Arxiv Papers of Today

生成时间: 2026-04-08 17:05:07 (UTC+8); Arxiv 发布时间: 2026-04-08 20:00 EDT (2026-04-09 08:00 UTC+8)

今天共有 32 篇相关文章

Keyword: reinforcement learning

Territory Paint Wars: Diagnosing and Mitigating Failure Modes in Competitive Multi-Agent PPO

领地涂漆战：在竞争性多代理PPO中诊断与缓解故障模式

Authors: Diyansha Singh
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2604.04983
Pdf link: https://arxiv.org/pdf/2604.04983
Abstract We present Territory Paint Wars, a minimal competitive multi-agent reinforcement learning environment implemented in Unity, and use it to systematically investigate failure modes of Proximal Policy Optimisation (PPO) under self-play. A first agent trained for $84{,}000$ episodes achieves only $26.8\%$ win rate against a uniformly-random opponent in a symmetric zero-sum game. Through controlled ablations we identify five implementation-level failure modes -- reward-scale imbalance, missing terminal signal, ineffective long-horizon credit assignment, unnormalised observations, and incorrect win detection -- each of which contributes critically to this failure in this setting. After correcting these issues, we uncover a distinct emergent pathology: competitive overfitting, where co-adapting agents maintain stable self-play performance while generalisation win rate collapses from $73.5\%$ to $21.6\%$. Critically, this failure is undetectable via standard self-play metrics: both agents co-adapt equally, so the self-play win rate remains near $50\%$ throughout the collapse. We propose a minimal intervention -- opponent mixing, where $20\%$ of training episodes substitute a fixed uniformly-random policy for the co-adaptive opponent -- which mitigates competitive overfitting and restores generalisation to $77.1\%$ ($\pm 12.6\%$, $10$ seeds) without population-based training or additional infrastructure. We open-source Territory Paint Wars to provide a reproducible benchmark for studying competitive MARL failure modes.
中文摘要 我们介绍了Territory Paint Wars，这是一个在Unity中实现的极简竞争性多智能体强化学习环境，并用它系统性地研究自玩下的近端策略优化（PPO）失败模式。第一个训练为$84{，}000$集数的特工，在对称零和游戏中对阵均匀随机对手时，仅能获得$26.8\%$$的胜率。通过受控消融，我们识别出五种实施层面的失败模式——奖励尺度失衡、终端信号缺失、无效的长期计分分配、非正规化观测值和错误的胜利检测——每一种都对该场景中的失败起关键作用。纠正这些问题后，我们发现了一种独特的新兴病理现象：竞争性过拟合，即共适应的代理保持稳定的自我对弈表现，而泛化胜率从73.5美元骤降至21.6%%。关键是，这种失败通过标准自玩指标无法检测：双方均等适应，因此自玩胜率在整个崩溃期间保持在约50%美元左右。我们提出一种最小干预——对手混合，即用20%美元的训练周期替代固定的均匀随机策略替代共适应对手——这减少了竞争性过拟合，并将泛化恢复到77.1%$（12.6%$，种子10美元），无需基于群体的训练或额外基础设施。我们开源了Territory Paint Wars，以提供一个可重复的MARL失效模式研究基准。

Enhancing sample efficiency in reinforcement-learning-based flow control: replacing the critic with an adaptive reduced-order model

增强基于强化学习的流量控制中的样本效率：用自适应降阶模型替代批判者

Authors: Zesheng Yao, Zhen-Hua Wan, Canjun Yang, Qingchao Xia, Mengqi Zhang
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2604.04986
Pdf link: https://arxiv.org/pdf/2604.04986
Abstract Model-free deep reinforcement learning (DRL) methods suffer from poor sample efficiency. To overcome this limitation, this work introduces an adaptive reduced-order-model (ROM)-based reinforcement learning framework for active flow control. In contrast to conventional actor--critic architectures, the proposed approach leverages a ROM to estimate the gradient information required for controller optimization. The design of the ROM structure incorporates physical insights. The ROM integrates a linear dynamical system and a neural ordinary differential equation (NODE) for estimating the nonlinearity in the flow. The parameters of the linear component are identified via operator inference, while the NODE is trained in a data-driven manner using gradient-based optimization. During controller--environment interactions, the ROM is continuously updated with newly collected data, enabling adaptive refinement of the model. The controller is then optimized through differentiable simulation of the ROM. The proposed ROM-based DRL framework is validated on two canonical flow control problems: Blasius boundary layer flow and flow past a square cylinder. For the Blasius boundary layer, the proposed method effectively reduces to a single-episode system identification and controller optimization process, yet it yields controllers that outperform traditional linear designs and achieve performance comparable to DRL approaches with minimal data. For the flow past a square cylinder, the proposed method achieves superior drag reduction with significantly fewer exploration data compared with DRL approaches. The work addresses a key component of model-free DRL control algorithms and lays the foundation for designing more sample-efficient DRL-based active flow controllers.
中文摘要 无模型深度强化学习（DRL）方法存在样本效率较差的问题。为克服这一限制，本研究引入了基于自适应的降序模型（ROM）强化学习框架，用于主动流量控制。与传统的actor-critic架构不同，所提方法利用ROM来估算控制器优化所需的梯度信息。ROM结构的设计包含了物理洞察。ROM集成了线性动力系统和神经常微分方程（NODE），用于估计流动中的非线性。线性分量的参数通过算符推断确定，而NODE则通过基于梯度的优化以数据驱动的方式训练。在控制器-环境交互过程中，ROM会持续更新新收集的数据，从而实现模型的自适应优化。随后，控制器通过对ROM的可微分仿真进行优化。所提出的基于ROM的DRL框架已在两个典型的流量控制问题上得到验证：Blasius边界层流动和通过方柱体的流量。对于布拉修斯边界层，所提方法实际上简化为单一事件的系统识别和控制器优化过程，但其控制器性能优于传统线性设计，并以极少的数据实现与日程学习（DRL）方法相当的性能。对于通过方柱体的流动，所提方法相比日行减排方法在探索数据显著减少的情况下实现了更优越的阻力减缓。该工作解决了无模型DRL控制算法的一个关键组成部分，并为设计更具采样效率的基于DRL的主动流量控制器奠定了基础。

Vintix II: Decision Pre-Trained Transformer is a Scalable In-Context Reinforcement Learner

Vintix II：决策预训练变换器是一款可扩展的上下文强化学习器

Authors: Andrei Polubarov, Lyubaykin Nikita, Alexander Derevyagin, Artyom Grishin, Igor Saprygin, Aleksandr Serkov, Mark Averchenko, Daniil Tikhonov, Maksim Zhdanov, Alexander Nikulin, Ilya Zisman, Albina Klepach, Alexey Zemtsov, Vladislav Kurenkov
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2604.05112
Pdf link: https://arxiv.org/pdf/2604.05112
Abstract Recent progress in in-context reinforcement learning (ICRL) has demonstrated its potential for training generalist agents that can acquire new tasks directly at inference. Algorithm Distillation (AD) pioneered this paradigm and was subsequently scaled to multi-domain settings, although its ability to generalize to unseen tasks remained limited. The Decision Pre-Trained Transformer (DPT) was introduced as an alternative, showing stronger in-context reinforcement learning abilities in simplified domains, but its scalability had not been established. In this work, we extend DPT to diverse multi-domain environments, applying Flow Matching as a natural training choice that preserves its interpretation as Bayesian posterior sampling. As a result, we obtain an agent trained across hundreds of diverse tasks that achieves clear gains in generalization to the held-out test set. This agent improves upon prior AD scaling and demonstrates stronger performance in both online and offline inference, reinforcing ICRL as a viable alternative to expert distillation for training generalist agents.
中文摘要 上下文强化学习（ICRL）的最新进展已证明其在训练能够直接在推理阶段获得新任务的通用智能体方面的潜力。算法蒸馏（AD）开创了这一范式，随后扩展到多域环境，尽管其泛化到看不见任务的能力仍然有限。作为替代方案引入了决策预训练变换器（DPT），在简化领域展现出更强的上下文强化学习能力，但其可扩展性尚未被确立。在本研究中，我们将DPT扩展到多域环境，将流匹配作为一种自然的训练选择，同时保持其作为贝叶斯后验采样的解释。因此，我们获得了一个在数百种不同任务中训练的代理，在推广到保留测试集方面取得了明显的提升。该代理优于以往的AD扩展，并在线上和线下推断中表现出更强的性能，进一步巩固了ICRL作为培训通用代理中专家蒸馏的可行替代方案。

Reasoning Through Chess: How Reasoning Evolves from Data Through Fine-Tuning and Reinforcement Learning

通过国际象棋推理：推理如何从数据通过微调和强化学习演变

Authors: Lucas Dionisopoulos, Nicklas Majamaki, Prithviraj Ammanabrolu
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2604.05134
Pdf link: https://arxiv.org/pdf/2604.05134
Abstract How can you get a language model to reason in a task it natively struggles with? We study how reasoning evolves in a language model -- from supervised fine-tuning (SFT) to reinforcement learning (RL) -- by analyzing how a set of theoretically-inspired datasets impacts language model performance in chess. We find that fine-tuning a model to directly predict the best move leads to effective RL and the strongest downstream performance -- however, the RL step elicits unfaithful reasoning (reasoning inconsistent with the chosen move). Alternatively, training on multi-move trajectories yields comparable downstream performance with faithful reasoning and more stable RL. We show that RL induces a substantial positive shift in the distribution of move quality and reduces hallucination rates as a side effect. Finally, we find several SFT-checkpoint metrics -- metrics spanning evaluation performance, hallucination rates, and reasoning quality -- to be predictive of post-RL model performance. We release checkpoints and final models as well as training data, evaluations, and code which allowed us to surpass leading open-source reasoning models in chess with a 7B-parameter model.
中文摘要 如何让语言模型在它本质上难以完成的任务中进行推理？我们通过分析一组理论启发的数据集如何影响国际象棋中的语言模型表现，研究推理在语言模型中如何演变——从监督微调（SFT）到强化学习（RL）。我们发现，微调模型以直接预测最佳走法能带来有效的强化学习和最强的下游表现——然而，强化学习步骤会引发不忠实的推理（推理与所选走法不一致）。或者，多步轨迹训练能带来类似的下游表现，且推理忠实且更稳定。我们表明，强化学习会显著促进走法质量分布的正向变化，并作为副作用降低幻觉发生率。最后，我们发现多个SFT检查点指标——涵盖评估表现、幻觉率和推理质量的指标——能够预测强化学习后模型的表现。我们发布了检查点和最终模型，以及训练数据、评估和代码，使我们能够超越国际象棋领域的领先开源推理模型，采用7B参数模型。

SenseAI: A Human-in-the-Loop Dataset for RLHF-Aligned Financial Sentiment Reasoning

SenseAI：一个用于RLHF对齐金融情绪推理的人机环绕数据集

Authors: Berny Kabalisa
Subjects: Subjects: Computation and Language (cs.CL); Computational Engineering, Finance, and Science (cs.CE)
Arxiv link: https://arxiv.org/abs/2604.05135
Pdf link: https://arxiv.org/pdf/2604.05135
Abstract We introduce SenseAI, a human-in-the-loop (HITL) validated financial sentiment dataset designed to capture not only model outputs but the full reasoning process behind them. Unlike existing resources, SenseAI incorporates reasoning chains, confidence scores, human correction signals, and real-world market outcomes, providing a structure aligned with Reinforcement Learning from Human Feedback (RLHF) paradigms. The dataset consists of 1,439 labelled data points across 40 US-listed equities and 13 financial data categories, enabling direct integration into modern LLM fine-tuning pipelines. Through analysis, we identify several systematic patterns in model behavior, including a novel failure mode we term Latent Reasoning Drift, where models introduce information not grounded in the input, as well as consistent confidence miscalibration and forward projection tendencies. These findings suggest that LLM errors in financial reasoning are not random but occur within a predictable and correctable regime, supporting the use of structured HITL data for targeted model improvement. We discuss implications for financial AI systems and highlight opportunities for applying SenseAI in model evaluation and alignment.
中文摘要 我们介绍了SenseAI，一个经过验证的人机在环（HITL）验证的金融情绪数据集，旨在捕捉模型输出及其背后的全部推理过程。与现有资源不同，SenseAI整合了推理链、信心评分、人类纠正信号和真实市场结果，提供了与人类反馈强化学习（RLHF）范式相一致的结构。该数据集包含1,439个标记数据点，涵盖40只美国上市股票和13个金融数据类别，支持直接集成到现代大型语言模型的微调流程中。通过分析，我们识别出模型行为中的若干系统性模式，包括一种我们称之为潜在推理漂移的新失效模式，即模型引入与输入无依的信息，以及持续的置信度校准错误和前瞻预测倾向。这些发现表明，金融推理中的LLM错误并非随机，而是发生在可预测且可纠正的范围内，支持利用结构化HITL数据进行针对性模型改进。我们讨论了对金融人工智能系统的影响，并强调了在模型评估和对齐中应用SenseAI的机会。

Bypassing the CSI Bottleneck: MARL-Driven Spatial Control for Reflector Arrays

绕过CSI瓶颈：MARL驱动的反射阵列空间控制

Authors: Hieu Le, Oguz Bedir, Mostafa Ibrahim, Jian Tao, Sabit Ekin
Subjects: Subjects: Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
Arxiv link: https://arxiv.org/abs/2604.05162
Pdf link: https://arxiv.org/pdf/2604.05162
Abstract Reconfigurable Intelligent Surfaces (RIS) are pivotal for next-generation smart radio environments, yet their practical deployment is severely bottlenecked by the intractable computational overhead of Channel State Information (CSI) estimation. To bypass this fundamental physical-layer barrier, we propose an AI-native, data-driven paradigm that replaces complex channel modeling with spatial intelligence. This paper presents a fully autonomous Multi-Agent Reinforcement Learning (MARL) framework to control mechanically adjustable metallic reflector arrays. By mapping high-dimensional mechanical constraints to a reduced-order virtual focal point space, we deploy a Centralized Training with Decentralized Execution (CTDE) architecture. Using Multi-Agent Proximal Policy Optimization (MAPPO), our decentralized agents learn cooperative beam-focusing strategies relying on user coordinates, achieving CSI-free operation. High-fidelity ray-tracing simulations in dynamic non-line-of-sight (NLOS) environments demonstrate that this multi-agent approach rapidly adapts to user mobility, yielding up to a 26.86 dB enhancement over static flat reflectors and outperforming single-agent and hardware-constrained DRL baselines in both spatial selectivity and temporal stability. Crucially, the learned policies exhibit good deployment resilience, sustaining stable signal coverage even under 1.0-meter localization noise. These results validate the efficacy of MARL-driven spatial abstractions as a scalable, highly practical pathway toward AI-empowered wireless networks.
中文摘要 可重构智能表面（RIS）对于下一代智能无线电环境至关重要，但其实际部署因信道状态信息（CSI）估计的复杂计算开销而严重受限。为了绕过这一基本的物理层障碍，我们提出了一种AI原生、数据驱动的范式，用空间智能取代复杂的频道建模。本文提出了一个完全自主的多智能体强化学习（MARL）框架，用于控制机械可调的金属反射阵列。通过将高维机械约束映射到低阶虚拟焦点空间，我们部署了去中心化执行的集中训练（CTDE）架构。通过多代理近端策略优化（MAPPO），我们的去中心化代理学习基于用户坐标的协作式束聚焦策略，实现无CSI操作。动态非视距（NLOS）环境中的高保真光线追踪仿真表明，这种多智能体方法能够迅速适应用户移动性，在空间选择性和时间稳定性方面，性能比静态平面反射器提升多达26.86 dB，并且在空间选择性和时间稳定性方面均优于单智能体和硬件限制的DRL基线。关键是，所学策略表现出良好的部署韧性，即使在1.0米定位噪声下也能保持稳定的信号覆盖。这些结果验证了基于MARL驱动的空间抽象作为迈向AI赋能无线网络的可扩展且高度实用路径的有效性。

Learning to Focus: CSI-Free Hierarchical MARL for Reconfigurable Reflectors

学习对焦：针对可重构反射镜的无CSI分层MARL

Authors: Hieu Le, Mostafa Ibrahim, Oguz Bedir, Jian Tao, Sabit Ekin
Subjects: Subjects: Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
Arxiv link: https://arxiv.org/abs/2604.05165
Pdf link: https://arxiv.org/pdf/2604.05165
Abstract Reconfigurable Intelligent Surfaces (RIS) has a potential to engineer smart radio environments for next-generation millimeter-wave (mmWave) networks. However, the prohibitive computational overhead of Channel State Information (CSI) estimation and the dimensionality explosion inherent in centralized optimization severely hinder practical large-scale deployments. To overcome these bottlenecks, we introduce a ``CSI-free" paradigm powered by a Hierarchical Multi-Agent Reinforcement Learning (HMARL) architecture to control mechanically reconfigurable reflective surfaces. By substituting pilot-based channel estimation with accessible user localization data, our framework leverages spatial intelligence for macro-scale wave propagation management. The control problem is decomposed into a two-tier neural architecture: a high-level controller executes temporally extended, discrete user-to-reflector allocations, while low-level controllers autonomously optimize continuous focal points utilizing Multi-Agent Proximal Policy Optimization (MAPPO) under a Centralized Training with Decentralized Execution (CTDE) scheme. Comprehensive deterministic ray-tracing evaluations demonstrate that this hierarchical framework achieves massive RSSI improvements of up to 7.79 dB over centralized baselines. Furthermore, the system exhibits robust multi-user scalability and maintains highly resilient beam-focusing performance under practical sub-meter localization tracking errors. By eliminating CSI overhead while maintaining high-fidelity signal redirection, this work establishes a scalable and cost-effective blueprint for intelligent wireless environments.
中文摘要 可重构智能表面（RIS）有潜力为下一代毫米波（mmWave）网络设计智能无线电环境。然而，信道状态信息（CSI）估计的巨大计算开销以及集中式优化固有的维度爆炸严重阻碍了大规模的实际部署。为克服这些瓶颈，我们引入了“无CSI”范式，采用分层多智能体强化学习（HMARL）架构，以控制机械可重构的反射面。通过用可访问的用户定位数据替代基于试点的通道估计，我们的框架利用空间智能实现宏观波传播管理。控制问题被分解为两层神经架构：高级控制器执行时间扩展的离散用户至反射者分配，而低级控制器则在集中训练与去中心化执行（CTDE）方案下，利用多代理近端策略优化（MAPPO）自主优化连续焦点。全面的确定性射线追踪评估表明，该分层框架在集中基线上实现了高达 7.79 dB 的巨大 RSSI 提升。此外，系统展现出强大的多用户可扩展性，并在实际亚米定位跟踪误差下保持高度韧性的束束聚焦性能。通过消除CSI开销同时保持高保真信号重定向，这项工作为智能无线环境建立了可扩展且经济高效的蓝图。

Cross-fitted Proximal Learning for Model-Based Reinforcement Learning

基于模型的近端学习交叉拟合

Authors: Nishanth Venkatesh, Andreas A. Malikopoulos
Subjects: Subjects: Machine Learning (cs.LG); Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2604.05185
Pdf link: https://arxiv.org/pdf/2604.05185
Abstract Model-based reinforcement learning is attractive for sequential decision-making because it explicitly estimates reward and transition models and then supports planning through simulated rollouts. In offline settings with hidden confounding, however, models learned directly from observational data may be biased. This challenge is especially pronounced in partially observable systems, where latent factors may jointly affect actions, rewards, and future observations. Recent work has shown that policy evaluation in such confounded partially observable Markov decision processes (POMDPs) can be reduced to estimating reward-emission and observation-transition bridge functions satisfying conditional moment restrictions (CMRs). In this paper, we study the statistical estimation of these bridge functions. We formulate bridge learning as a CMR problem with nuisance objects given by a conditional mean embedding and a conditional density. We then develop a $K$-fold cross-fitted extension of the existing two-stage bridge estimator. The proposed procedure preserves the original bridge-based identification strategy while using the available data more efficiently than a single sample split. We also derive an oracle-comparator bound for the cross-fitted estimator and decompose the resulting error into a Stage I term induced by nuisance estimation and a Stage II term induced by empirical averaging.
中文摘要 基于模型的强化学习在顺序决策中具有吸引力，因为它明确估计奖励和过渡模型，并通过模拟推广支持规划。然而，在带有隐藏混杂因素的离线环境中，直接从观察数据中学习的模型可能会存在偏差。这一挑战在部分可观测系统中尤为明显，潜在因素可能共同影响行为、奖励和未来观测。最新研究表明，这种混杂的部分可观测马尔可夫决策过程（POMDPs）的策略评估可以归约为估计满足条件矩限制（CMR）的奖励-发射和观测-转移桥函数。本文研究了这些桥函数的统计估计。我们将桥式学习表述为一个带有条件平均嵌入和条件密度的干扰对象的CMR问题。然后我们开发了现有两级桥梁估算器的$K折交叉拟合扩展。该方法保留了原始的桥接识别策略，同时比单样本拆分更高效地利用现有数据。我们还推导出交叉拟合估计器的oracle-比较器界限，并将所得误差分解为由干扰估计诱导的第一阶段项和由经验平均引发的第二阶段项。

Vehicle-as-Prompt: A Unified Deep Reinforcement Learning Framework for Heterogeneous Fleet Vehicle Routing Problem

车辆即提示：针对异构车队车辆路由问题的统一深度强化学习框架

Authors: Shihong Huang, Shengjie Wang, Lei Gao, Hong Ma, Zhanluo Zhang, Feng Zhang, Weihua Zhou
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2604.05195
Pdf link: https://arxiv.org/pdf/2604.05195
Abstract Unlike traditional homogeneous routing problems, the Heterogeneous Fleet Vehicle Routing Problem (HFVRP) involves heterogeneous fixed costs, variable travel costs, and capacity constraints, rendering solution quality highly sensitive to vehicle selection. Furthermore, real-world logistics applications often impose additional complex constraints, markedly increasing computational complexity. However, most existing Deep Reinforcement Learning (DRL)-based methods are restricted to homogeneous scenarios, leading to suboptimal performance when applied to HFVRP and its complex variants. To bridge this gap, we investigate HFVRP under complex constraints and develop a unified DRL framework capable of solving the problem across various variant settings. We introduce the Vehicle-as-Prompt (VaP) mechanism, which formulates the problem as a single-stage autoregressive decision process. Building on this, we propose VaP-CSMV, a framework featuring a cross-semantic encoder and a multi-view decoder that effectively addresses various problem variants and captures the complex mapping relationships between vehicle heterogeneity and customer node attributes. Extensive experimental results demonstrate that VaP-CSMV significantly outperforms existing state-of-the-art DRL-based neural solvers and achieves competitive solution quality compared to traditional heuristic solvers, while reducing inference time to mere seconds. Furthermore, the framework exhibits strong zero-shot generalization capabilities on large-scale and previously unseen problem variants, while ablation studies validate the vital contribution of each component.
中文摘要 与传统的同质路由问题不同，异构车队车辆路由问题（HFVRP）涉及异构固定成本、可变旅行成本和容量约束，使得解决方案质量对车辆选择极为敏感。此外，现实物流应用常常施加额外的复杂约束，显著增加计算复杂度。然而，大多数现有基于深度强化学习（DRL）的方法仅限于同质场景，导致在HFVRP及其复杂变体上应用时性能不理想。为弥合这一空白，我们在复杂约束下研究HFVRP，并开发了一个统一的DRL框架，能够在各种不同环境中解决该问题。我们介绍了载体即提示（Vehicle-as-Prompt，简称VaP）机制，将问题表述为单阶段自回归决策过程。基于此，我们提出了VaP-CSMV，这是一个具有跨语义编码器和多视角解码器的框架，能够有效解决各种问题变体，并捕捉车辆异质性与客户节点属性之间复杂的映射关系。大量实验结果表明，VaP-CSMV显著优于现有最先进的基于日程学习的神经求解器，并且在与传统启发式求解器相比下具有竞争力的解答质量，同时将推理时间缩短至仅几秒钟。此外，该框架在大规模且此前未见的问题变体上展现出强大的零样本泛化能力，消融研究验证了每个组成部分的重要贡献。

Right at My Level: A Unified Multilingual Framework for Proficiency-Aware Text Simplification

正合我的水平：一个统一的多语言简化框架，用于熟练度感知文本简化

Authors: Jinhong Jeong, Junghun Park, Youngjae Yu
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2604.05302
Pdf link: https://arxiv.org/pdf/2604.05302
Abstract Text simplification supports second language (L2) learning by providing comprehensible input, consistent with the Input Hypothesis. However, constructing personalized parallel corpora is costly, while existing large language model (LLM)-based readability control methods rely on pre-labeled sentence corpora and primarily target English. We propose Re-RIGHT, a unified reinforcement learning framework for adaptive multilingual text simplification without parallel corpus supervision. We first show that prompting-based lexical simplification at target proficiency levels (CEFR, JLPT, TOPIK, and HSK) performs poorly at easier levels and for non-English languages, even with state-of-the-art LLMs such as GPT-5.2 and Gemini 2.5. To address this, we collect 43K vocabulary-level data across four languages (English, Japanese, Korean, and Chinese) and train a compact 4B policy model using Re-RIGHT, which integrates three reward modules: vocabulary coverage, semantic preservation, and coherence. Compared to the stronger LLM baselines, Re-RIGHT achieves higher lexical coverage at target proficiency levels while maintaining original meaning and fluency.
中文摘要 文本简化通过提供可理解的输入支持第二语言（L2）学习，这与输入假说相符。然而，构建个性化平行语料库成本较高，而现有基于大型语言模型（LLM）的可读性控制方法依赖预先标记的句子语料库，主要针对英语。我们提出了Re-RIGHT，这是一个统一的强化学习框架，用于自适应多语言文本简化，无需并行语料库监督。我们首先表明，基于提示的词汇简化在目标熟练度（CEFR、JLPT、TOPIK 和 HSK）中，即使在较简单的水平和非英语语言中表现较差，即使是使用了最先进的大型语言模型如 GPT-5.2 和 Gemini 2.5。为此，我们收集了四种语言（英语、日语、韩语和中文）的4.3万词汇级数据，并使用Re-RIGHT训练了一个紧凑的4B政策模型，该模型集成了三个奖励模块：词汇覆盖、语义保持和连贯性。与更强的LLM基线相比，Re-RIGHT在目标熟练度水平下实现了更高的词汇覆盖，同时保持了原创意义和流利度。

Curr-RLCER:Curriculum Reinforcement Learning For Coherence Explainable Recommendation

Curr-RLCER：课程强化学习以实现连贯性可解释的推荐

Authors: Xiangchen Pan, Wei Wei
Subjects: Subjects: Information Retrieval (cs.IR)
Arxiv link: https://arxiv.org/abs/2604.05341
Pdf link: https://arxiv.org/pdf/2604.05341
Abstract Explainable recommendation systems (RSs) are designed to explicitly uncover the rationale of each recommendation, thereby enhancing the transparency and credibility of RSs. Previous methods often jointly predicted ratings and generated explanations, but overlooked the incoherence of such two objectives. To address this issue, we propose Curr-RLCER, a reinforcement learning framework for explanation coherent recommendation with dynamic rating alignment. It employs curriculum learning, transitioning from basic predictions (i.e., click through rating-CTR, selection-based rating) to open-ended recommendation explanation generation. In particular, the rewards of each stage are designed for progressively enhancing the stability of RSs. Furthermore, a coherence-driven reward mechanism is also proposed to enforce the coherence between generated explanations and predicted ratings, supported by a specifically designed evaluation scheme. The extensive experimental results on three explainable recommendation datasets indicate that the proposed framework is effective. Codes and datasets are available at this https URL.
中文摘要 可解释推荐系统（RS）旨在明确揭示每项推荐的理由，从而提升RS的透明度和可信度。以往的方法常常共同预测评分并生成解释，但忽视了这两种目标的不一致性。为解决这一问题，我们提出了Curr-RLCER，一种用于动态评级对齐的强化学习框架，用于解释连贯推荐。它采用课程学习，从基础预测（如点击评分-点击率、基于选择的评分）过渡到开放式推荐解释生成。特别是，每个阶段的奖励旨在逐步增强RS的稳定性。此外，还提出了一种以连贯性驱动的奖励机制，以强制生成解释与预测评分之间的一致性，并由专门设计的评估方案支持。对三个可解释推荐数据集的广泛实验结果表明该框架是有效的。代码和数据集可在该 https URL 访问。

Neural Assistive Impulses: Synthesizing Exaggerated Motions for Physics-based Characters

神经辅助冲动：为基于物理的字符合成夸张动作

Authors: Zhiquan Wang, Bedrich Benes
Subjects: Subjects: Artificial Intelligence (cs.AI); Graphics (cs.GR)
Arxiv link: https://arxiv.org/abs/2604.05394
Pdf link: https://arxiv.org/pdf/2604.05394
Abstract Physics-based character animation has become a fundamental approach for synthesizing realistic, physically plausible motions. While current data-driven deep reinforcement learning (DRL) methods can synthesize complex skills, they struggle to reproduce exaggerated, stylized motions, such as instantaneous dashes or mid-air trajectory changes, which are required in animation but violate standard physical laws. The primary limitation stems from modeling the character as an underactuated floating-base system, in which internal joint torques and momentum conservation strictly govern motion. Direct attempts to enforce such motions via external wrenches often lead to training instability, as velocity discontinuities produce sparse, high-magnitude force spikes that prevent policy convergence. We propose Assistive Impulse Neural Control, a framework that reformulates external assistance in impulse space rather than force space to ensure numerical stability. We decompose the assistive signal into an analytic high-frequency component derived from Inverse Dynamics and a learned low-frequency residual correction, governed by a hybrid neural policy. We demonstrate that our method enables robust tracking of highly agile, dynamically infeasible maneuvers that were previously intractable for physics-based methods.
中文摘要 基于物理的角色动画已成为合成真实且物理可信动作的基本方法。虽然当前基于数据的数据的深度强化学习（DRL）方法可以综合复杂技能，但它们难以再现夸张且风格化的动作，如瞬间冲刺或空中轨迹变化，这些动作在动画中是必需但违反标准物理定律的。主要局限在于将特性建模为欠驱动浮动基系统，其中内部关节力矩和动量守恒严格控制运动。通过外部扳手直接强制执行此类运动常常导致训练不稳定，因为速度不连续会产生稀疏且幅度较大的力尖峰，阻碍政策趋同。我们提出了辅助冲动神经控制框架，该框架重新表述了冲动空间中的外部辅助，而非强制空间，以确保数值稳定性。我们将辅助信号分解为由逆动力学导出的高频分析成分和由混合神经策略控制的学习低频残差修正。我们证明了该方法能够稳健跟踪高度敏捷、动态上不可行的机动，这些动作此前物理方法难以解决。

Reproducing AlphaZero on Tablut: Self-Play RL for an Asymmetric Board Game

在Tablut上重现AlphaZero：自玩RL的非对称桌游

Authors: Tõnis Lees, Tambet Matiisen
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2604.05476
Pdf link: https://arxiv.org/pdf/2604.05476
Abstract This work investigates the adaptation of the AlphaZero reinforcement learning algorithm to Tablut, an asymmetric historical board game featuring unequal piece counts and distinct player objectives (king capture versus king escape). While the original AlphaZero architecture successfully leverages a single policy and value head for symmetric games, applying it to asymmetric environments forces the network to learn two conflicting evaluation functions, which can hinder learning efficiency and performance. To address this, the core architecture is modified to use separate policy and value heads for each player role, while maintaining a shared residual trunk to learn common board features. During training, the asymmetric structure introduced training instabilities, notably catastrophic forgetting between the attacker and defender roles. These issues were mitigated by applying C4 data augmentation, increasing the replay buffer size, and having the model play 25 percent of training games against randomly sampled past checkpoints. Over 100 self-play iterations, the modified model demonstrated steady improvement, achieving a BayesElo rating of 1235 relative to a randomly initialized baseline. Training metrics also showed a significant decrease in policy entropy and average remaining pieces, reflecting increasingly focused and decisive play. Ultimately, the experiments confirm that AlphaZero's self-play framework can transfer to highly asymmetric games, provided that distinct policy/value heads and robust stabilization techniques are employed.
中文摘要 本研究研究将AlphaZero强化学习算法应用于Tablut，这是一种具有不等棋子数量和不同玩家目标（国王吃子与国王逃脱）的非对称历史棋盘游戏。虽然最初的AlphaZero架构成功地利用了单一策略和值头来对称游戏，但将其应用于非对称环境时，网络必须学习两个相互冲突的评估函数，这可能阻碍学习效率和性能。为此，核心架构被修改为为每个玩家角色使用独立的策略和值头，同时保留共享的残余主干以学习共同的板面特性。在训练过程中，这种不对称结构引入了训练不稳定性，尤其是攻击方和防守方角色之间的灾难性遗忘。这些问题通过应用C4数据增强、增加回放缓冲区大小以及让模型在25%的训练游戏中与随机抽样的过去检查点进行对抗来缓解。经过100次自玩迭代，修改后的模型稳步提升，相较于随机初始化基线，贝叶斯洛评分达到1235。训练指标还显示策略熵和剩余棋子平均值显著下降，反映出游戏的聚焦和果断性。最终，实验证实了AlphaZero的自玩框架可以迁移到高度非对称的游戏中，前提是采用不同的策略/价值头和稳健的稳定技术。

Can We Trust a Black-box LLM? LLM Untrustworthy Boundary Detection via Bias-Diffusion and Multi-Agent Reinforcement Learning

我们能信任黑盒大型语言模型吗？通过偏扩散和多智能体强化学习实现的不可靠边界检测 LLM

Authors: Xiaotian Zhou, Di Tang, Xiaofeng Wang, Xiaozhong Liu
Subjects: Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2604.05483
Pdf link: https://arxiv.org/pdf/2604.05483
Abstract Large Language Models (LLMs) have shown a high capability in answering questions on a diverse range of topics. However, these models sometimes produce biased, ideologized or incorrect responses, limiting their applications if there is no clear understanding of which topics their answers can be trusted. In this research, we introduce a novel algorithm, named as GMRL-BD, designed to identify the untrustworthy boundaries (in terms of topics) of a given LLM, with black-box access to the LLM and under specific query constraints. Based on a general Knowledge Graph (KG) derived from Wikipedia, our algorithm incorporates with multiple reinforcement learning agents to efficiently identify topics (some nodes in KG) where the LLM is likely to generate biased answers. Our experiments demonstrated the efficiency of our algorithm, which can detect the untrustworthy boundary with just limited queries to the LLM. Additionally, we have released a new dataset containing popular LLMs including Llama2, Vicuna, Falcon, Qwen2, Gemma2 and Yi-1.5, along with labels indicating the topics on which each LLM is likely to be biased.
中文摘要 大型语言模型（LLMs）在回答各种主题问题方面展现出了很高的能力。然而，这些模型有时会产生带有偏见、意识形态化或错误的回答，如果不清楚哪些主题值得信赖，其应用范围就会受到限制。在本研究中，我们引入了一种名为GMRL-BD的新算法，旨在识别给定LLM在特定查询约束下通过黑箱访问的不可信边界（以主题为单位）。基于源自维基百科的通用知识图谱（KG），我们的算法结合了多个强化学习代理，高效识别大型语言模型可能生成有偏见答案的主题（KG中的某些节点）。我们的实验展示了算法的高效性，只需有限查询LLM，就能检测出不可信边界。此外，我们还发布了一个新数据集，包含包括Llama2、Vicuna、Falcon、Qwen2、Gemma2和Yi-1.5在内的热门大型语言模型，并附有标签显示每个大型语言模型可能存在的偏见主题。

OmniDiagram: Advancing Unified Diagram Code Generation via Visual Interrogation Reward

OmniDiagram：通过视觉询问奖励推进统一图代码生成

Authors: Haoyue Yang, Xuanle Zhao, Xuexin Liu, Feibang Jiang, Yao Zhu
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2604.05514
Pdf link: https://arxiv.org/pdf/2604.05514
Abstract The paradigm of programmable diagram generation is evolving rapidly, playing a crucial role in structured visualization. However, most existing studies are confined to a narrow range of task formulations and language support, constraining their applicability to diverse diagram types. In this work, we propose OmniDiagram, a unified framework that incorporates diverse diagram code languages and task definitions. To address the challenge of aligning code logic with visual fidelity in Reinforcement Learning (RL), we introduce a novel visual feedback strategy named Visual Interrogation Verifies All (\textsc{Viva}). Unlike brittle syntax-based rules or pixel-level matching, \textsc{Viva} rewards the visual structure of rendered diagrams through a generative approach. Specifically, \textsc{Viva} actively generates targeted visual inquiries to scrutinize diagram visual fidelity and provides fine-grained feedback for optimization. This mechanism facilitates a self-evolving training process, effectively obviating the need for manually annotated ground truth code. Furthermore, we construct M3$^2$Diagram, the first large-scale diagram code generation dataset, containing over 196k high-quality instances. Experimental results confirm that the combination of SFT and our \textsc{Viva}-based RL allows OmniDiagram to establish a new state-of-the-art (SOTA) across diagram code generation benchmarks.
中文摘要 可编程图表生成的范式正在迅速发展，在结构化可视化中发挥着关键作用。然而，大多数现有研究仅限于狭窄的任务表述和语言支持，限制了其适用于多种图表类型的应用。在本研究中，我们提出了OmniDiagram，一个整合了多种图解代码语言和任务定义的统一框架。为了解决强化学习（RL）中将代码逻辑与视觉忠实度对齐的挑战，我们引入了一种新的视觉反馈策略，名为“Visual Interrogation Verifies All”（\textsc{Viva}）。与脆弱的语法规则或像素级匹配不同，\textsc{Viva} 通过生成方法奖励渲染图的视觉结构。具体来说，\textsc{Viva} 主动生成针对性的视觉查询，以审查图表的视觉真实度，并提供细致的优化反馈。该机制促进了自我演化的训练过程，有效避免了手动注释的真实码需求。此外，我们构建了M3$^2$Diagram，这是首个大规模图生成数据集，包含超过19.6万个高质量实例。实验结果证实，SFT与基于\textsc{Viva}的强化学习结合，使OmniDiagram能够在图码生成基准中建立新的最先进（SOTA）水平。

UniCreative: Unifying Long-form Logic and Short-form Sparkle via Reference-Free Reinforcement Learning

UniCreative：通过无引用强化学习统一长形式逻辑与短形式闪耀

Authors: Xiaolong Wei, Zerun Zhu, Simin Niu, Xingyu Zhang, Peiying Yu, Changxuan Xiao, Yuchen Li, Jicheng Yang, Zhejun Zhao, Chong Meng, Long Xia, Daiting Shi
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2604.05517
Pdf link: https://arxiv.org/pdf/2604.05517
Abstract A fundamental challenge in creative writing lies in reconciling the inherent tension between maintaining global coherence in long-form narratives and preserving local expressiveness in short-form texts. While long-context generation necessitates explicit macroscopic planning, short-form creativity often demands spontaneous, constraint-free expression. Existing alignment paradigms, however, typically employ static reward signals and rely heavily on high-quality supervised data, which is costly and difficult to scale. To address this, we propose \textbf{UniCreative}, a unified reference-free reinforcement learning framework. We first introduce \textbf{AC-GenRM}, an adaptive constraint-aware reward model that dynamically synthesizes query-specific criteria to provide fine-grained preference judgments. Leveraging these signals, we propose \textbf{ACPO}, a policy optimization algorithm that aligns models with human preferences across both content quality and structural paradigms without supervised fine-tuning and ground-truth references. Empirical results demonstrate that AC-GenRM aligns closely with expert evaluations, while ACPO significantly enhances performance across diverse writing tasks. Crucially, our analysis reveals an emergent meta-cognitive ability: the model learns to autonomously differentiate between tasks requiring rigorous planning and those favoring direct generation, validating the effectiveness of our direct alignment approach.
中文摘要 创意写作中的一个根本挑战在于调和长篇叙事中保持全球连贯性与短篇文本中保持地方表达力之间的内在张力。虽然长上下文生成需要明确的宏观规划，而短形式的创造力往往需要自发且无约束的表达。然而，现有的比对范式通常采用静态奖励信号，并高度依赖高质量的监督数据，这既昂贵又难以扩展。为此，我们提出了 \textbf{UniCreative}，一个统一的无引用强化学习框架。我们首先介绍了 \textbf{AC-GenRM}，一种自适应约束感知奖励模型，动态综合查询特定标准，提供细致的偏好判断。利用这些信号，我们提出了 \textbf{ACPO}，一种策略优化算法，能够在内容质量和结构范式上将模型与人类偏好对齐，无需监督微调和真实引用。实证结果表明，AC-GenRM与专家评估高度契合，而ACPO则显著提升了多样化写作任务中的表现。关键是，我们的分析揭示了一种新兴的元认知能力：模型能够自主区分需要严谨规划的任务和偏好直接生成的任务，验证了直接对齐方法的有效性。

ActivityEditor: Learning to Synthesize Physically Valid Human Mobility

活动编辑器：学习综合物理上有效的人类流动性

Authors: Chenjie Yang, Yutian Jiang, Anqi Liang, Wei Qi, Chenyu Wu, Junbo Zhang
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2604.05529
Pdf link: https://arxiv.org/pdf/2604.05529
Abstract Human mobility modeling is indispensable for diverse urban applications. However, existing data-driven methods often suffer from data scarcity, limiting their applicability in regions where historical trajectories are unavailable or restricted. To bridge this gap, we propose \textbf{ActivityEditor}, a novel dual-LLM-agent framework designed for zero-shot cross-regional trajectory generation. Our framework decomposes the complex synthesis task into two collaborative stages. Specifically, an intention-based agent, which leverages demographic-driven priors to generate structured human intentions and coarse activity chains to ensure high-level socio-semantic coherence. These outputs are then refined by editor agent to obtain mobility trajectories through iteratively revisions that enforces human mobility law. This capability is acquired through reinforcement learning with multiple rewards grounded in real-world physical constraints, allowing the agent to internalize mobility regularities and ensure high-fidelity trajectory generation. Extensive experiments demonstrate that \textbf{ActivityEditor} achieves superior zero-shot performance when transferred across diverse urban contexts. It maintains high statistical fidelity and physical validity, providing a robust and highly generalizable solution for mobility simulation in data-scarce scenarios. Our code is available at: this https URL.
中文摘要 人类流动建模对于多样化的城市应用至关重要。然而，现有的数据驱动方法常常存在数据稀缺性，限制了其在历史轨迹不可得或受限的地区的适用性。为弥合这一差距，我们提出了 \textbf{ActivityEditor}，一个新颖的双大型语言模型代理框架，旨在实现零射程跨区域轨迹生成。我们的框架将复杂的综合任务分解为两个协作阶段。具体来说，是一种基于意图的智能体，利用人口驱动的先验生成结构化的人类意图和粗略的活动链，以确保高层次的社会语义一致性。编辑代理随后对这些输出进行细化，通过迭代修订以强制执行人类移动定律，从而获得移动轨迹。该能力通过强化学习获得，基于现实物理约束的多重奖励，使智能体能够内化移动规律，确保高精度轨迹生成。大量实验表明，\textbf{ActivityEditor}在跨越多样城市环境时，能够实现更优越的零射击性能。它保持了高统计保真度和物理效度，为数据稀缺场景下的移动模拟提供了稳健且高度通用的解决方案。我们的代码可在以下 https URL 获取。

SignalClaw: LLM-Guided Evolutionary Synthesis of Interpretable Traffic Signal Control Skills

SignalClaw：以LLM引导的可解释交通信号控制技能的进化综合

Authors: Da Lei, Feng Xiao, Lu Li, Yuzhan Liu
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2604.05535
Pdf link: https://arxiv.org/pdf/2604.05535
Abstract Traffic signal control TSC requires strategies that are both effective and interpretable for deployment, yet reinforcement learning produces opaque neural policies while program synthesis depends on restrictive domain-specific languages. We present SIGNALCLAW, a framework that uses large language models LLMs as evolutionary skill generators to synthesize and refine interpretable control skills for adaptive TSC. Each skill includes rationale, selection guidance, and executable code, making policies human-inspectable and self-documenting. At each generation, evolution signals from simulation metrics such as queue percentiles, delay trends, and stagnation are translated into natural language feedback to guide improvement. SignalClaw also introduces event-driven compositional evolution: an event detector identifies emergency vehicles, transit priority, incidents, and congestion via TraCI, and a priority dispatcher selects specialized skills. Each skill is evolved independently, and a priority chain enables runtime composition without retraining. We evaluate SignalClaw on routine and event-injected SUMO scenarios against four baselines. On routine scenarios, it achieves average delay of 7.8 to 9.2 seconds, within 3 to 10 percent of the best method, with low variance across random seeds. Under event scenarios, it yields the lowest emergency delay 11.2 to 18.5 seconds versus 42.3 to 72.3 for MaxPressure and 78.5 to 95.3 for DQN, and the lowest transit person delay 9.8 to 11.5 seconds versus 38.7 to 45.2 for MaxPressure. In mixed events, the dispatcher composes skills effectively while maintaining stable overall delay. The evolved skills progress from simple linear rules to conditional strategies with multi-feature interactions, while remaining fully interpretable and directly modifiable by traffic engineers.
中文摘要 交通信号控制TSC需要既有效又可解释的策略以便部署，而强化学习会产生不透明的神经策略，而程序综合则依赖于限制性的领域特定语言。我们介绍SIGNALCLAW框架，利用大型语言模型LLM作为进化技能生成器，综合和完善可解释的控制技能，用于自适应TSC。每个技能都包含理由、选择指导和可执行代码，使政策易于人工检查并实现自我记录。在每一代，来自模拟指标（如队列百分位、延迟趋势和停滞）的演变信号都会转化为自然语言反馈，以指导改进。SignalClaw 还引入了事件驱动的组合演进：事件检测器通过 TraCI 识别紧急车辆、交通优先级、事故和拥堵情况，优先调度员则选择专业技能。每个技能独立进化，优先级链支持运行时组合而无需重新训练。我们在常规和事件注入的SUMO场景下，结合四个基线评估了SignalClaw。在常规场景中，平均延迟为7.8至9.2秒，平均延迟为最佳方法的3%到10%，随机种子间的差异较低。在事件情景下，紧急延迟最低为11.2至18.5秒，MaxPressure为42.3至72.3秒，DQN为78.5至95.3秒;最低的交通人员延迟为9.8至11.5秒，MaxPressure为38.7至45.2秒。在混合项目中，调度员在保持整体稳定的延误的同时，有效地编排技能。这些进化技能从简单的线性规则逐步发展到具有多特征交互的条件策略，同时保持交通工程师完全可解释和直接修改。

COSMO-Agent: Tool-Augmented Agent for Closed-loop Optimization,Simulation,and Modeling Orchestration

COSMO-Agent：用于闭环优化、仿真和建模编排的工具增强代理

Authors: Liyuan Deng, Shujian Deng, Yongkang Chen, Yongkang Dai, Zhihang Zhong, Linyang Li, Xiao Sun, Yilei Shi, Huaxi Huang
Subjects: Subjects: Artificial Intelligence (cs.AI); Graphics (cs.GR)
Arxiv link: https://arxiv.org/abs/2604.05547
Pdf link: https://arxiv.org/pdf/2604.05547
Abstract Iterative industrial design-simulation optimization is bottlenecked by the CAD-CAE semantic gap: translating simulation feedback into valid geometric edits under diverse, coupled constraints. To fill this gap, we propose COSMO-Agent (Closed-loop Optimization, Simulation, and Modeling Orchestration), a tool-augmented reinforcement learning (RL) framework that teaches LLMs to complete the closed-loop CAD-CAE process. Specifically, we cast CAD generation, CAE solving, result parsing, and geometry revision as an interactive RL environment, where an LLM learns to orchestrate external tools and revise parametric geometries until constraints are satisfied. To make this learning stable and industrially usable, we design a multi-constraint reward that jointly encourages feasibility, toolchain robustness, and structured output validity. In addition, we contribute an industry-aligned dataset that covers 25 component categories with executable CAD-CAE tasks to support realistic training and evaluation. Experiments show that COSMO-Agent training substantially improves small open-source LLMs for constraint-driven design, exceeding large open-source and strong closed-source models in feasibility, efficiency, and stability.
中文摘要 迭代工业设计-仿真优化被CAD-CAE语义差距所限制：将仿真反馈转化为在多样耦合约束下的有效几何编辑。为填补这一空白，我们提出了COSMO-Agent（闭环优化、仿真与建模编排）框架，这是一种工具增强强化学习（RL）框架，旨在教授LLM完成闭环CAD-CAE过程。具体来说，我们将CAD生成、CAE求解、结果解析和几何修订构建为一个互动式强化学习环境，LLM学习协调外部工具并修正参数几何，直到满足约束条件。为了使这种学习稳定且具工业实用性，我们设计了一种多约束奖励，共同鼓励可行性、工具链的稳健性和结构化的输出有效性。此外，我们还提供了一个行业对齐的数据集，涵盖25个组件类别，并可执行CAD-CAE任务，以支持真实的训练和评估。实验显示，COSMO-Agent训练显著提升了小型开源LLMs的约束驱动设计能力，在可行性、效率和稳定性方面超过了大型开源和强闭源模型。

An Iterative Test-and-Repair Framework for Competitive Code Generation

一个用于竞争性代码生成的迭代测试与修复框架

Authors: Lingxiao Tang, Muyang Ye, Zhaoyang Chu, Xiaoxue Ren, Zhongxin Liu, Lingfeng Bao, He Ye
Subjects: Subjects: Software Engineering (cs.SE)
Arxiv link: https://arxiv.org/abs/2604.05560
Pdf link: https://arxiv.org/pdf/2604.05560
Abstract Large language models (LLMs) have made remarkable progress in code generation, but competitive programming remains a challenge. Recent training-based methods have improved code generation by using reinforcement learning (RL) with execution feedback. The more recent framework CURE further incorporates test generation into the training process, jointly training a Coder and a Tester within a single model. At inference time, the Coder generates many candidate programs, and the Tester generates tests from the problem description. The candidate who passes the most of the generated tests is selected as the final answer. However, CURE has two critical limitations. First, the Tester never reads any candidate code, so its tests often fail to expose implementation-specific bugs. Second, the Coder generates every candidate from scratch and never learns to fix a buggy program based on a failed test. To address these limitations, we propose FixAudit, which approaches competitive code generation from a new perspective: starting from a single initial candidate, it iteratively improves the candidate through a targeted test-and-repair debugging cycle. The framework trains one shared model with two specialized roles through four stages: the Fixer, which repairs the current candidate based on a failing test, and the Auditor, which reads the candidate code to generate new tests that expose its remaining bugs. We evaluate FixAudit on three benchmarks: APPS, CodeContests, and xCodeEval. Applied to a 7B model, the framework surpasses the average performance of the larger 32B baseline within the same model family under the zero-shot setting. Compared to strong baselines built on the same 7B base model, FixAudit improves average Pass@1 by 35.1% to 36.8% and average AvgPassRatio by 7.1% to 24.5%.
中文摘要 大型语言模型（LLM）在代码生成方面取得了显著进步，但竞争性编程依然是一大挑战。近期基于训练的方法通过使用强化学习（RL）和执行反馈，改进了代码生成。较新的CURE框架进一步将测试生成纳入训练流程，在单一模型中共同训练编码员和测试员。在推理时，编码器生成多个候选程序，测试员则根据问题描述生成测试。通过最多生成测试的候选人将被选为最终答案。然而，CURE存在两个关键局限。首先，测试器从不读取候选代码，因此其测试常常无法暴露实现特定的漏洞。其次，程序员从零生成每个候选人，却从未学会如何修复一个有漏洞的程序，仅仅因为测试失败。为解决这些局限性，我们提出了FixAudit，它从一个初始候选代码出发，通过有针对性的测试和修复调试周期迭代改进候选代码生成。该框架通过四个阶段训练一个共享模型，并有两个专业角色：修复者（Fixer），根据测试失败修复当前候选对象;审计者（Auditor），读取候选代码生成新测试，揭示剩余缺陷。我们基于三个基准测试：APPS、CodeContests 和 xCodeEvaval 来评估 FixAudit。应用于7B模型时，该框架在零样本设置下超过了同一模型家族中更大32B基线的平均性能。与基于同一7B基础模型的强有力基线相比，FixAudit平均Pass@1提升了35.1%至36.8%，平均通过率提升7.1%至24.5%。

Uncovering Linguistic Fragility in Vision-Language-Action Models via Diversity-Aware Red Teaming

通过多样性感知红队，揭示视觉-语言-行动模型中的语言脆弱性

Authors: Baoshun Tong, Haoran He, Ling Pan, Yang Liu, Liang Lin
Subjects: Subjects: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2604.05595
Pdf link: https://arxiv.org/pdf/2604.05595
Abstract Vision-Language-Action (VLA) models have achieved remarkable success in robotic manipulation. However, their robustness to linguistic nuances remains a critical, under-explored safety concern, posing a significant safety risk to real-world deployment. Red teaming, or identifying environmental scenarios that elicit catastrophic behaviors, is an important step in ensuring the safe deployment of embodied AI agents. Reinforcement learning (RL) has emerged as a promising approach in automated red teaming that aims to uncover these vulnerabilities. However, standard RL-based adversaries often suffer from severe mode collapse due to their reward-maximizing nature, which tends to converge to a narrow set of trivial or repetitive failure patterns, failing to reveal the comprehensive landscape of meaningful risks. To bridge this gap, we propose a novel \textbf{D}iversity-\textbf{A}ware \textbf{E}mbodied \textbf{R}ed \textbf{T}eaming (\textbf{DAERT}) framework, to expose the vulnerabilities of VLAs against linguistic variations. Our design is based on evaluating a uniform policy, which is able to generate a diverse set of challenging instructions while ensuring its attack effectiveness, measured by execution failures in a physical simulator. We conduct extensive experiments across different robotic benchmarks against two state-of-the-art VLAs, including $\pi_0$ and OpenVLA. Our method consistently discovers a wider range of more effective adversarial instructions that reduce the average task success rate from 93.33\% to 5.85\%, demonstrating a scalable approach to stress-testing VLA agents and exposing critical safety blind spots before real-world deployment.
中文摘要 视觉-语言-行动（VLA）模型在机器人操作方面取得了显著成功。然而，它们对语言细微差别的鲁棒性仍是一个关键且未被充分探讨的安全问题，对实际部署构成重大安全风险。红队行动，即识别导致灾难性行为的环境场景，是确保具身人工智能智能体安全部署的重要一步。强化学习（RL）已成为自动化红队中一种有前景的方法，旨在发现这些漏洞。然而，基于强化学习的标准对手由于其追求最大化奖励的特性，常常会出现严重的模式崩溃，这种模式往往趋于狭窄的简单或重复失败模式，未能揭示有意义风险的全面格局。为弥合这一差距，我们提出了一种新颖的\textbf{D}iversity-\textbf{A}ware \textbf{E}mbodied \textbf{R}ed \textbf{T}eaming （\textbf{DAERT}）框架，以揭示VLA在语言变异中的脆弱性。我们的设计基于评估统一策略，能够生成多样且具有挑战性的指令，同时确保其攻击有效性，通过物理模拟器中的执行失败来衡量。我们在不同的机器人基准测试中，针对两种最先进的VLA进行了广泛实验，包括$\pi_0$和OpenVLA。我们的方法持续发现更广泛的更有效对抗指令，将平均任务成功率从93.33%降至5.85%，展示了一种可扩展的压力测试方法，在实际部署前暴露关键安全盲点。

CuraLight: Debate-Guided Data Curation for LLM-Centered Traffic Signal Control

CuraLight：以LLM为中心的交通信号控制进行辩论引导数据管理

Authors: Qing Guo, Xinhang Li, Junyu Chen, Zheng Guo, Shengzhe Xu, Lin Zhang, Lei Li
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2604.05663
Pdf link: https://arxiv.org/pdf/2604.05663
Abstract Traffic signal control (TSC) is a core component of intelligent transportation systems (ITS), aiming to reduce congestion, emissions, and travel time. Recent approaches based on reinforcement learning (RL) and large language models (LLMs) have improved adaptivity, but still suffer from limited interpretability, insufficient interaction data, and weak generalization to heterogeneous intersections. This paper proposes CuraLight, an LLM-centered framework where an RL agent assists the fine-tuning of an LLM-based traffic signal controller. The RL agent explores traffic environments and generates high-quality interaction trajectories, which are converted into prompt-response pairs for imitation fine-tuning. A multi-LLM ensemble deliberation system further evaluates candidate signal timing actions through structured debate, providing preference-aware supervision signals for training. Experiments conducted in SUMO across heterogeneous real-world networks from Jinan, Hangzhou, and Yizhuang demonstrate that CuraLight consistently outperforms state-of-the-art baselines, reducing average travel time by 5.34 percent, average queue length by 5.14 percent, and average waiting time by 7.02 percent. The results highlight the effectiveness of combining RL-assisted exploration with deliberation-based data curation for scalable and interpretable traffic signal control.
中文摘要 交通信号控制（TSC）是智能交通系统（ITS）的核心组成部分，旨在减少拥堵、排放和缩短行车时间。基于强化学习（RL）和大型语言模型（LLMs）的最新方法提高了适应性，但仍存在可解释性有限、交互数据不足以及对异构交叉的推广性较弱的问题。本文提出了CuraLight，一种以大型语言模型为中心的框架，其中强化学习代理协助微调基于LLM的交通信号控制器。强化学习代理探索流量环境并生成高质量的交互轨迹，这些轨迹被转换为提示响应对以进行模拟微调。多LLM集合审议系统通过结构化辩论进一步评估候选信号时序动作，提供偏好意识的监督信号用于训练。在济南、杭州和宜庄的异构现实世界网络中进行的SUMO实验表明，CuraLight始终优于最先进的基线，平均行程时间减少5.34%，平均排队时间缩短5.14%，平均等待时间减少7.02%。结果凸显了将强化学习辅助探索与基于审议的数据管理相结合，实现可扩展且可解释的交通信号控制的有效性。

Can Large Language Models Reinvent Foundational Algorithms?

大型语言模型能否重新发明基础算法？

Authors: Jian Zhao, Haoren Luo, Yu Wang, Yuhan Cao, Pingyue Sheng, Tianxing He
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2604.05716
Pdf link: https://arxiv.org/pdf/2604.05716
Abstract LLMs have shown strong potential to advance scientific discovery. Whether they possess the capacity for foundational innovation, however, remains an open question. In this work, we focus on a prerequisite for foundational innovation: can LLMs reinvent foundational algorithms in computer science? Our \textit{Unlearn-and-Reinvent} pipeline applies LLM unlearning to remove a specific foundational algorithm, such as Dijkstra's or Euclid's algorithm, from an LLM's pretrained knowledge, and then tests whether the model can reinvent it in a controlled environment. To enable effective unlearning, we adopt a GRPO-based, on-policy unlearning method. Across 10 target algorithms, 3 strong open-weight models, and 3 hint levels, our experiments demonstrate that (1) the strongest model Qwen3-4B-Thinking-2507 successfully reinvents 50% of the algorithms with no hint, 70% at hint level 1, and 90% at hint level 2; (2) a few high-level hints can enhance the reinvention success rate, but even step-by-step hints fail for those complicated algorithms; and (3) test-time reinforcement learning enables successful reinvention for the Strassen algorithm at hint level 2. Through analyses of output trajectories and ablation studies, we find that generative verifier in the reinvention phase plays a critical role in sustaining models' reasoning strength, helping to avoid the ``thought collapse'' phenomenon. These findings offer insights into both the potential and current limits of LLMs' innovative thinking.
中文摘要 LLMs展现出推动科学发现的强大潜力。然而，他们是否具备基础创新的能力，仍是一个未知数。本研究聚焦于基础创新的一个前提：大型语言模型能否重新发明计算机科学中的基础算法？我们的\textit{Unlearn-and-Reinvent}流水线应用LLM去学习，从LLM的预训练知识中移除特定的基础算法，如Dijkstra算法或Euclid算法，然后测试模型是否能在受控环境中重新发明它。为了实现有效的去学习，我们采用基于GRPO的政策式去学方法。在10个目标算法、3个强开权模型和3个提示水平中，我们的实验表明：（1）最强模型Qwen3-4B-Thinking-2507成功重塑了50%的算法且无提示，70%在提示级别1,90%在提示级别2;（2）少量高层提示可以提高重塑成功率，但即使是逐步提示对复杂算法也无效;以及（3）测试时强化学习使得Strassen算法在提示层2的成功重构成为可能。通过对输出轨迹的分析和消融研究，我们发现生成验证器在重塑阶段发挥关键作用，有助于维持模型的推理强度，有助于避免“思维崩溃”现象。这些发现为大型语言模型创新思维的潜力和当前局限提供了见解。

Emergent social transmission of model-based representations without inference

基于模型的表征在无推理的情况下涌现的社会传递

Authors: Silja Keßler, Miriam Bautista-Salinero, Claudio Tennie, Charley M. Wu
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2604.05777
Pdf link: https://arxiv.org/pdf/2604.05777
Abstract How do people acquire rich, flexible knowledge about their environment from others despite limited cognitive capacity? Humans are often thought to rely on computationally costly mentalizing, such as inferring others' beliefs. In contrast, cultural evolution emphasizes that behavioral transmission can be supported by simple social cues. Using reinforcement learning simulations, we show how minimal social learning can indirectly transmit higher-level representations. We simulate a naïve agent searching for rewards in a reconfigurable environment, learning either alone or by observing an expert - crucially, without inferring mental states. Instead, the learner heuristically selects actions or boosts value representations based on observed actions. Our results demonstrate that these cues bias the learner's experience, causing its representation to converge toward the expert's. Model-based learners benefit most from social exposure, showing faster learning and more expert-like representations. These findings show how cultural transmission can arise from simple, non-mentalizing processes exploiting asocial learning mechanisms.
中文摘要 尽管认知能力有限，人们如何从他人那里获得丰富且灵活的环境知识？人类常被认为依赖计算成本高的心理化，比如推断他人的信念。相比之下，文化进化强调行为传递可以通过简单的社会线索来支持。通过强化学习模拟，我们展示了最小的社会学习如何间接传递更高层次的表征。我们模拟一个天真的智能体在可重构环境中寻找奖励，学习时可以独自学习，也可以通过观察专家学习——关键是，不推断心理状态。相反，学习者通过启发式方式选择动作或基于观察到的行为提升价值表示。我们的结果表明，这些线索会偏向学习者的体验，使其表现趋向于专家的体验。基于模型的学习者最能从社交接触中受益，表现出更快的学习速度和更具专家性的表征。这些发现表明，文化传递可以源自利用非社会性学习机制的简单、非心理化过程。

Hierarchical Reinforcement Learning with Augmented Step-Level Transitions for LLM Agents

带有增强步骤级过渡的层级强化学习，适用于LLM代理

Authors: Shuai Zhen, Yanhua Yu, Ruopei Guo, Nan Cheng, Yang Deng
Subjects: Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2604.05808
Pdf link: https://arxiv.org/pdf/2604.05808
Abstract Large language model (LLM) agents have demonstrated strong capabilities in complex interactive decision-making tasks. However, existing LLM agents typically rely on increasingly long interaction histories, resulting in high computational cost and limited scalability. In this paper, we propose STEP-HRL, a hierarchical reinforcement learning (HRL) framework that enables step-level learning by conditioning only on single-step transitions rather than full interaction histories. STEP-HRL structures tasks hierarchically, using completed subtasks to represent global progress of overall task. By introducing a local progress module, it also iteratively and selectively summarizes interaction history within each subtask to produce a compact summary of local progress. Together, these components yield augmented step-level transitions for both high-level and low-level policies. Experimental results on ScienceWorld and ALFWorld benchmarks consistently demonstrate that STEP-HRL substantially outperforms baselines in terms of performance and generalization while reducing token usage. Our code is available at this https URL.
中文摘要 大型语言模型（LLM）代理在复杂的交互式决策任务中展现出强大能力。然而，现有的LLM代理通常依赖于越来越长的交互历史，导致计算成本高且扩展性有限。本文提出了STEP-HRL，一种层级强化学习（HRL）框架，通过仅对单步转移而非完整交互历史进行条件，实现步级学习。STEP-HRL 将任务结构分层，通过已完成的子任务表示整体任务的整体进展。通过引入局部进展模块，它还迭代且选择性地总结每个子任务中的交互历史，生成一个紧凑的本地进展总结。这些组成部分共同为高层和低层政策带来增强的阶级过渡。ScienceWorld和ALFWorld基准测试的实验结果一致显示，STEP-HRL在性能和泛化性方面远超基线，同时减少了令牌使用。我们的代码可在此 https URL 访问。

Reinforcement Learning with Negative Tests as Completeness Signal for Formal Specification Synthesis

以负性测试作为形式规范综合完备信号的强化学习

Authors: Zhechong Huang, Zhao Zhang, Zeyu Sun, Huifeng Sun, Yingfei Xiong
Subjects: Subjects: Software Engineering (cs.SE)
Arxiv link: https://arxiv.org/abs/2604.05820
Pdf link: https://arxiv.org/pdf/2604.05820
Abstract The specification synthesis task aims to automatically generate specifications, together with any necessary auxiliary verification annotations, for existing programs. This task is important because such specifications serve as behavioral contracts that support modular reasoning and reusable verification across a codebase. At the same time, it remains challenging because verifier-only feedback is fundamentally incomplete: passing verification establishes soundness, but cannot distinguish weak specifications from strong ones. What is missing is a fine-grained signal for specification completeness. We present SpecRL, a reinforcement learning framework for specification synthesis in Dafny. SpecRL introduces a self-contained pipeline that generates negative tests, i.e., input-output pairs that can never be produced by the program. We use the fraction of these negative tests rejected by a candidate specification as a signal of specification completeness, which is integrated into the reward for RL training. Experiments across four model sizes show that SpecRL improves both specification strength and verification success over SFT and RL with a binary specification-strength reward, generalizes to an out-of-distribution benchmark, and remains competitive on that unseen benchmark compared to much larger general-purpose LLMs.
中文摘要 规范综合任务旨在自动生成规范，以及必要的辅助验证注释，以满足现有程序的需求。这项任务很重要，因为这些规范作为行为契约，支持模块化推理和跨代码库的可复用验证。同时，这仍然具有挑战性，因为仅验证者反馈本质上是不完整的：通过验证就能确立合理性，但无法区分弱规范和强规范。缺少的是规范完整性的细致信号。我们介绍SpecRL，一个用于Dafny规范综合的强化学习框架。SpecRL 引入了一个自包含的流水线，生成负性测试，即程序永远无法生成的输入输出对。我们使用候选规范拒绝的负面测试比例作为规范完整性的信号，并将其整合进强化学习训练的奖励中。跨越四种模型尺寸的实验表明，SpecRL在实现二元规格强度奖励下，提升了SFT和RL的规范强度和验证成功率，并推广到分布外基准，并且在该未知基准测试中与更大通用LLM保持竞争力。

Precise Aggressive Aerial Maneuvers with Sensorimotor Policies

精准的攻击性空中机动与感觉运动政策

Authors: Tianyue Wu, Guangtong Xu, Zihan Wang, Junxiao Lin, Tianyang Chen, Yuze Wu, Zhichao Han, Zhiyang Liu, Fei Gao
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2604.05828
Pdf link: https://arxiv.org/pdf/2604.05828
Abstract Precise aggressive maneuvers with lightweight onboard sensors remains a key bottleneck in fully exploiting the maneuverability of drones. Such maneuvers are critical for expanding the systems' accessible area by navigating through narrow openings in the environment. Among the most relevant problems, a representative one is aggressive traversal through narrow gaps with quadrotors under SE(3) constraints, which require the quadrotors to leverage a momentary tilted attitude and the asymmetry of the airframe to navigate through gaps. In this paper, we achieve such maneuvers by developing sensorimotor policies directly mapping onboard vision and proprioception into low-level control commands. The policies are trained using reinforcement learning (RL) with end-to-end policy distillation in simulation. We mitigate the fundamental hardness of model-free RL's exploration on the restricted solution space with an initialization strategy leveraging trajectories generated by a model-based planner. Careful sim-to-real design allows the policy to control a quadrotor through narrow gaps with low clearances and high repeatability. For instance, the proposed method enables a quadrotor to navigate a rectangular gap at a 5 cm clearance, tilted at up to 90-degree orientation, without knowledge of the gap's position or orientation. Without training on dynamic gaps, the policy can reactively servo the quadrotor to traverse through a moving gap. The proposed method is also validated by training and deploying policies on challenging tracks of narrow gaps placed closely. The flexibility of the policy learning method is demonstrated by developing policies for geometrically diverse gaps, without relying on manually defined traversal poses and visual features.
中文摘要 配备轻量化机载传感器的精准攻击性机动仍是充分利用无人机机动性的关键瓶颈。这些机动对于通过环境中狭窄的开口来扩展系统的可达范围至关重要。其中最具代表性的是一个典型问题，即在SE（3）约束下，四旋翼机在狭窄缝隙中进行激进的横扫，这要求四旋翼机利用瞬间倾斜姿态和机体的不对称性来穿越间隙。本文通过开发感觉运动策略，将机载视觉和本体感觉直接映射到低级别控制指令中，实现了这些机动。这些策略通过强化学习（RL）训练，并在模拟中实现端到端的策略提炼。我们通过利用基于模型的规划器生成轨迹的初始化策略，缓解了无模型强化学习在受限解空间探索的根本困难。精心的模拟与实物设计使该策略能够通过狭窄间隙控制四旋翼，保持低净空和高重复性。例如，拟议方法使四旋翼飞机能够在5厘米间隙内通过一个矩形间隙，且倾斜角度可达90度，而无需知道该间隙的位置或方向。如果没有动态间隙的训练，该政策可以被动伺服四旋翼通过移动间隙。该方法还通过培训和部署政策，在狭窄的狭窄通道上进行培训和部署来验证。策略学习方法的灵活性通过为几何多样性的间隙制定策略得以体现，而无需依赖手动定义的横贯姿势和视觉特征。

AgentGL: Towards Agentic Graph Learning with LLMs via Reinforcement Learning

AgentGL：通过强化学习实现大型语言模型的智能图学习

Authors: Yuanfu Sun, Kang Li, Dongzhe Fan, Jiajin Liu, Qiaoyu Tan
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2604.05846
Pdf link: https://arxiv.org/pdf/2604.05846
Abstract Large Language Models (LLMs) increasingly rely on agentic capabilities-iterative retrieval, tool use, and decision-making-to overcome the limits of static, parametric knowledge. Yet existing agentic frameworks treat external information as unstructured text and fail to leverage the topological dependencies inherent in real-world data. To bridge this gap, we introduce Agentic Graph Learning (AGL), a paradigm that reframes graph learning as an interleaved process of topology-aware navigation and LLM-based inference. Specifically, we propose AgentGL, the first reinforcement learning (RL)-driven framework for AGL. AgentGL equips an LLM agent with graph-native tools for multi-scale exploration, regulates tool usage via search-constrained thinking to balance accuracy and efficiency, and employs a graph-conditioned curriculum RL strategy to stabilize long-horizon policy learning without step-wise supervision. Across diverse Text-Attributed Graph (TAG) benchmarks and multiple LLM backbones, AgentGL substantially outperforms strong GraphLLMs and GraphRAG baselines, achieving absolute improvements of up to 17.5% in node classification and 28.4% in link prediction. These results demonstrate that AGL is a promising frontier for enabling LLMs to autonomously navigate and reason over complex relational environments. The code is publicly available at this https URL.
中文摘要 大型语言模型（LLM）越来越依赖代理能力——迭代检索、工具使用和决策——以克服静态参数化知识的局限。然而，现有的代理框架将外部信息视为非结构化文本，未能利用现实世界数据中固有的拓扑依赖性。为弥合这一空白，我们引入了代理图学习（AGL），这一范式将图学习重新框架为拓扑感知导航与基于大型语言模型（LLM）的交错过程。具体来说，我们提出了AgentGL，这是首个基于强化学习（RL）驱动的AGL框架。AgentGL为LLM代理配备了图原生工具以实现多尺度探索，通过搜索受限思维调节工具使用，以平衡准确性与效率，并采用图条件课程强化学习策略，稳定长期策略学习，无需分阶段监督。在多种文本属性图（TAG）基准测试和多个大型语言模型骨干中，AgentGL在节点分类和链路预测方面实现了最高17.5%的绝对提升，实现了高达17.5%的绝对提升。这些结果表明，AGL是使LLM能够自主导航和推理复杂关系环境的前沿。代码在此 https URL 公开获取。

Saliency-Guided Representation with Consistency Policy Learning for Visual Unsupervised Reinforcement Learning

具有一致性策略学习的显著性引导表征用于视觉无监督强化学习

Authors: Jingbo Sun, Qichao Zhang, Songjun Tu, Xing Fang, Yupeng Zheng, Haoran Li, Ke Chen, Dongbin Zhao
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2604.05931
Pdf link: https://arxiv.org/pdf/2604.05931
Abstract Zero-shot unsupervised reinforcement learning (URL) offers a promising direction for building generalist agents capable of generalizing to unseen tasks without additional supervision. Among existing approaches, successor representations (SR) have emerged as a prominent paradigm due to their effectiveness in structured, low-dimensional settings. However, SR methods struggle to scale to high-dimensional visual environments. Through empirical analysis, we identify two key limitations of SR in visual URL: (1) SR objectives often lead to suboptimal representations that attend to dynamics-irrelevant regions, resulting in inaccurate successor measures and degraded task generalization; and (2) these flawed representations hinder SR policies from modeling multi-modal skill-conditioned action distributions and ensuring skill controllability. To address these limitations, we propose Saliency-Guided Representation with Consistency Policy Learning (SRCP), a novel framework that improves zero-shot generalization of SR methods in visual URL. SRCP decouples representation learning from successor training by introducing a saliency-guided dynamics task to capture dynamics-relevant representations, thereby improving successor measure and task generalization. Moreover, it integrates a fast-sampling consistency policy with URL-specific classifier-free guidance and tailored training objectives to improve skill-conditioned policy modeling and controllability. Extensive experiments on 16 tasks across 4 datasets from the ExORL benchmark demonstrate that SRCP achieves state-of-the-art zero-shot generalization in visual URL and is compatible with various SR methods.
中文摘要 零样子无监督强化学习（URL）为构建能够在无需额外监督的情况下泛化到看不见任务的通用代理提供了有前景的方向。在现有方法中，继任表示（SR）因其在结构化、低维环境中的有效性而成为一个显著的范式。然而，SR方法难以扩展到高维视觉环境。通过实证分析，我们识别出视觉URL中SR的两个关键局限性：（1）SR目标常导致表现不优，关注动态无关区域，导致后继测量不准确且任务泛化降低;以及（2）这些有缺陷的表示阻碍SR政策建模多模态技能条件动作分布，并无法确保技能可控性。为解决这些局限性，我们提出了带有一致性策略学习（SRCP）的显著性引导表示（SRCP），这是一个新颖框架，旨在改善可视化URL中SR方法的零样本推广。SRCP通过引入显著性引导动力学任务，将表征学习与继继训练分离，捕捉与动力学相关的表示，从而提升继任测量和任务泛化。此外，它将快速抽样一致性策略与无 URL 分类器的指导和定制化的训练目标相结合，以提升技能条件策略建模和可控性。对ExORL基准测试的4个数据集中的16个任务进行了大量实验，表明SRCP实现了最先进的零射点概括化，并兼容多种SR方法。

MARL-GPT: Foundation Model for Multi-Agent Reinforcement Learning

MARL-GPT：多智能体强化学习基础模型

Authors: Maria Nesterova, Mikhail Kolosov, Anton Andreychuk, Egor Cherepanov, Oleg Bulichev, Alexey Kovalev, Konstantin Yakovlev, Aleksandr Panov, Alexey Skrynnik
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2604.05943
Pdf link: https://arxiv.org/pdf/2604.05943
Abstract Recent advances in multi-agent reinforcement learning (MARL) have demonstrated success in numerous challenging domains and environments, but typically require specialized models for each task. In this work, we propose a coherent methodology that makes it possible for a single GPT-based model to learn and perform well across diverse MARL environments and tasks, including StarCraft Multi-Agent Challenge, Google Research Football and POGEMA. Our method, MARL-GPT, applies offline reinforcement learning to train at scale on the expert trajectories (400M for SMACv2, 100M for GRF, and 1B for POGEMA) combined with a single transformer-based observation encoder that requires no task-specific tuning. Experiments show that MARL-GPT achieves competitive performance compared to specialized baselines in all tested environments. Thus, our findings suggest that it is, indeed, possible to build a multi-task transformer-based model for a wide variety of (significantly different) multi-agent problems paving the way to the fundamental MARL model (akin to ChatGPT, Llama, Mistral etc. in natural language modeling).
中文摘要 多智能体强化学习（MARL）的最新进展在众多具有挑战性的领域和环境中取得了成功，但通常每项任务都需要专门的模型。在本研究中，我们提出了一种连贯的方法论，使单个基于GPT的模型能够在多种MARL环境和任务中学习和表现良好，包括星际争霸多智能体挑战赛、谷歌研究足球和POGEMA。我们的方法MARL-GPT应用离线强化学习，在专家轨迹（SMACv2为400M，GRF为100M，POGEMA为1B）上大规模训练，并结合一个基于变换器的单一变换器，无需针对特定任务进行调优。实验表明，MARL-GPT在所有测试环境中均能与专业基线竞争。因此，我们的发现表明，确实可以构建一个基于多任务的变换器模型，以应对各种（显著不同的）多智能体问题，为基础的MARL模型铺平道路（类似于自然语言建模中的ChatGPT、Llama、Mistral等）。

Scientific Graphics Program Synthesis via Dual Self-Consistency Reinforcement Learning

通过双自一致性强化学习进行科学图形程序综合

Authors: Juekai Lin, Yun Zhu, Honglin Lin, Sijing Li, Tianwei Lin, Zheng Liu, Xiaoyang Wang, Wenqiao Zhang, Lijun Wu
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2604.06079
Pdf link: https://arxiv.org/pdf/2604.06079
Abstract Graphics Program Synthesis is pivotal for interpreting and editing visual data, effectively facilitating the reverse-engineering of static visuals into editable TikZ code. While TikZ is the de facto standard for scientific schematics due to its programmatic flexibility, its requirement for rigorous spatial precision presents a significant challenge for Multimodal Large Language Models. Progress is currently stifled by two primary gaps: (1) Data Quality Gap: existing image-TikZ corpora often lack strict executability and reliable visual alignment; (2) Evaluation Gap: a lack of benchmarks for both structural and visual fidelity. To address these, we present a closed-loop framework featuring: SciTikZ-230K, a large-scale, high-quality dataset from our Execution-Centric Data Engine covering 11 diverse scientific disciplines; SciTikZ-Bench, a multifaceted benchmark spanning from basic geometric constructs to intricate hierarchical schematics to evaluate both visual fidelity and structural logic. To further broaden the scope of visual-code optimization methodology, we introduce a novel Dual Self-Consistency Reinforcement Learning optimization paradigm, which utilizes Round-Trip Verification to penalize degenerate code and boost overall self-consistency. Empowered by these, our trained model SciTikZer-8B achieves state-of-the-art performance, consistently outperforming proprietary giants like Gemini-2.5-Pro and massive models like Qwen3-VL-235B-A22B-Instruct.
中文摘要 图形程序合成在解释和编辑视觉数据方面至关重要，有效促进静态视觉的逆向工程，生成可编辑的TikZ代码。虽然TikZ因其程序灵活性而成为科学示意图的事实标准，但其对严格空间精度的要求对多模态大型语言模型构成了重大挑战。目前进展受两个主要差距所阻碍：（1）数据质量差距：现有的图片-TikZ语料库常缺乏严格的可执行性和可靠的视觉对齐;（2）评估差距：结构和视觉真实度均缺乏基准。为应对这些问题，我们提出了一个闭环框架，包括：SciTikZ-230K，这是我们执行中心数据引擎的大规模高质量数据集，涵盖11个不同科学领域;SciTikZ-Bench，一个涵盖基础几何结构到复杂层级示意图的多面基准测试，用于评估视觉真实性和结构逻辑。为了进一步拓宽可视化代码优化方法的范围，我们引入了一种新的双自一致性强化学习优化范式，利用往返验证惩罚退化代码并提升整体自一致性。借助这些技术，我们训练好的SciTikZer-8B实现了最先进的性能，持续优于Gemini-2.5-Pro等专有巨头和Qwen3-VL-235B-A22B-Instruct等大型机型。

MMEmb-R1: Reasoning-Enhanced Multimodal Embedding with Pair-Aware Selection and Adaptive Control

MMEmb-R1：推理增强多模嵌入，结合配对感知选择和自适应控制

Authors: Yuchi Wang, Haiyang Yu, Weikang Bian, Jiefeng Long, Xiao Liang, Chao Feng, Hongsheng Li
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2604.06156
Pdf link: https://arxiv.org/pdf/2604.06156
Abstract MLLMs have been successfully applied to multimodal embedding tasks, yet their generative reasoning capabilities remain underutilized. Directly incorporating chain-of-thought reasoning into embedding learning introduces two fundamental challenges. First, structural misalignment between instance-level reasoning and pairwise contrastive supervision may lead to shortcut behavior, where the model merely learns the superficial format of reasoning. Second, reasoning is not universally beneficial for embedding tasks. Enforcing reasoning for all inputs may introduce unnecessary computation and latency, and can even obscure salient semantic signals for simple cases. To address these issues, we propose MMEmb-R1, an adaptive reasoning-based multimodal embedding framework. We formulate reasoning as a latent variable and introduce pair-aware reasoning selection that employs counterfactual intervention to identify reasoning paths beneficial for query-target alignment. Furthermore, we adopt reinforcement learning to selectively invoke reasoning only when necessary. Experiments on the MMEB-V2 benchmark demonstrate that our model achieves a score of 71.2 with only 4B parameters, establishing a new state-of-the-art while significantly reducing reasoning overhead and inference latency.
中文摘要 MLLM已成功应用于多模态嵌入任务，但其生成推理能力仍未被充分利用。直接将思维链推理融入嵌入式学习会带来两个根本挑战。首先，实例级推理与两对对比监督之间的结构性错位可能导致简化行为，即模型仅学习表面推理格式。其次，推理并非对嵌入任务普遍有益。对所有输入强制推理可能引入不必要的计算和延迟，甚至可能在简单情况下遮蔽显著语义信号。为解决这些问题，我们提出了MMEmb-R1，一种基于自适应推理的多模态嵌入框架。我们将推理作为潜在变量，引入配对感知推理选择，利用反事实干预识别有利于查询-目标对齐的推理路径。此外，我们采用强化学习，仅在必要时选择性地调用推理。基于MMEB-V2基准测试的实验表明，我们的模型仅用4B参数即可获得71.2分，开创了新的先进水平，同时显著降低了推理开销和推理延迟。

Keyword: diffusion policy

There is no result