Arxiv Papers of Today

生成时间: 2026-04-21 17:45:34 (UTC+8); Arxiv 发布时间: 2026-04-21 20:00 EDT (2026-04-22 08:00 UTC+8)

今天共有 80 篇相关文章

Keyword: reinforcement learning

CFMS: Towards Explainable and Fine-Grained Chinese Multimodal Sarcasm Detection Benchmark

CFMS：迈向可解释且细粒度的中国多模态讽刺检测基准

Authors: Junzhao Zhang, Hsiu-Yuan Huang, Chenming Tang, Yutong Yang, Yunfang Wu
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2604.16372
Pdf link: https://arxiv.org/pdf/2604.16372
Abstract Multimodal sarcasm detection has recently garnered significant attention. However, existing benchmarks suffer from coarse-grained annotations and limited cultural coverage, which hinder research into fine-grained semantic understanding. To address this, we construct CFMS, the first fine-grained multimodal sarcasm dataset tailored for Chinese social media. It comprises 2,796 high-quality image-text pairs and provides a triple-level annotation framework: sarcasm identification, target recognition, and explanation generation. We find that the fine-grained explanation annotations effectively guide AI in generating images with explicit sarcastic intent. Furthermore, we curate a high-consistency parallel Chinese-English metaphor subset (200 entries each), revealing significant limitations of current models in metaphoric reasoning. To overcome the constraints of traditional retrieval methods, we propose a Reinforcement Learning-augmented In-Context Learning strategy (PGDS) to dynamically optimize exemplar selection. Extensive experiments demonstrate that CFMS provides a solid foundation for building reliable multimodal sarcasm understanding systems, and the PGDS method significantly outperforms existing baselines on key tasks. Our data and code are available at this https URL.
中文摘要 多模态讽刺检测最近引起了广泛关注。然而，现有基准测试存在粗粒度注释和有限的文化覆盖，阻碍了对细致语义理解的研究。为此，我们构建了CFMS，这是首个专为中国社交媒体量身定制的细粒度多模态讽刺数据集。它包含2,796对高质量的图像-文本，并提供了三层注释框架：讽刺识别、目标识别和解释生成。我们发现，细致的解释注释有效地引导人工智能生成带有明确讽刺意图的图像。此外，我们策划了一个高一致性的中英平行隐喻子集（每200条），揭示了当前模型在隐喻推理中的显著局限性。为克服传统检索方法的限制，我们提出了一种强化学习增强的上下文学习策略（PGDS），以动态优化范例选择。大量实验表明，CFMS为构建可靠的多模讽刺理解系统提供了坚实基础，PGDS方法在关键任务上显著优于现有基线。我们的数据和代码可在该 https URL 访问。

Reciprocal Co-Training (RCT): Coupling Gradient-Based and Non-Differentiable Models via Reinforcement Learning

互惠协同训练（RCT）：通过强化学习耦合基于梯度的模型与不可微模型

Authors: Yunshuo Tian, Akayou Kitessa, Tanuja Chitnis, Yijun Zhao
Subjects: Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2604.16378
Pdf link: https://arxiv.org/pdf/2604.16378
Abstract Large language models (LLMs) and classical machine learning methods offer complementary strengths for predictive modeling, yet their fundamentally different representations and training paradigms hinder effective integration: LLMs rely on gradient-based optimization over textual data, whereas models such as Random Forests (RF) employ non-differentiable feature partitioning. This work introduces a reciprocal co-training framework that couples an LLM with an RF classifier via reinforcement learning, creating an iterative feedback loop in which each model improves using signals from the other. Tabular data are reformulated into standardized textual representations for the LLM, whose embeddings augment the RF feature space, while calibrated RF probability estimates provide feedback signals that guide reinforcement learning updates of the LLM. Experiments across three medical datasets demonstrate consistent performance gains for both models, with particularly strong effects for the LLM. Ablation analyses show that iterative refinement, hybrid reward design, and dimensionality control jointly contribute to these gains. The proposed framework provides a general mechanism that allows incompatible model families to leverage each other's strengths through bidirectional adaptation.
中文摘要 大型语言模型（LLM）和经典机器学习方法在预测建模方面互补，但它们根本不同的表示方式和训练范式阻碍了有效集成：LLM依赖基于梯度的文本优化，而随机森林（RF）等模型则采用不可微分特征划分。本研究引入了一种互惠共训练框架，通过强化学习将LLM与射频分类器结合，形成一个迭代反馈循环，使每个模型利用对方的信号进行改进。表格数据被重新表述为LLM的标准化文本表示，其嵌入增强了射频特征空间，而校准后的射频概率估计则提供反馈信号，指导LLM的强化学习更新。在三个医学数据集上的实验显示，两种模型的性能提升一致，LLM效果尤为显著。消融分析显示，迭代精炼、混合奖励设计和维度控制共同促成了这些收益。所提出的框架提供了一个通用机制，使不兼容的模型族能够通过双向适应相互利用优势。

GraphRAG-Router: Learning Cost-Efficient Routing over GraphRAGs and LLMs with Reinforcement Learning

GraphRAG 路由器：通过强化学习学习通过 GraphRAG 和大型语言模型学习成本效益高的路由

Authors: Dongzhe Fan, Chuanhao Ji, Zimu Wang, Tong Chen, Qiaoyu Tan
Subjects: Subjects: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2604.16401
Pdf link: https://arxiv.org/pdf/2604.16401
Abstract Graph-based retrieval-augmented generation (GraphRAG) has recently emerged as a powerful paradigm for knowledge-intensive question answering, especially for tasks that require structured evidence organization and multi-hop reasoning. However, existing GraphRAG systems are typically built in a one-size-fits-all manner, relying on a fixed retrieval framework and a single, often large and costly, generator LLM for all queries. This static design limits their ability to adapt to the complexity of varying questions and often incurs unnecessary computational cost. To fill in the gap, we propose GraphRAG-Router, a cost-efficient framework that adopts a hierarchical routing strategy to coordinate heterogeneous GraphRAGs and generator LLMs. Specifically, GraphRAG-Router is first warmed up through supervised fine-tuning and then optimized with a two-stage reinforcement learning procedure, whose second stage introduces a curriculum cost-aware reward to encourage difficulty-aware and economical generator allocation. Extensive experiments on six general-domain and multi-hop QA benchmarks show that GraphRAG-Router consistently outperforms state-of-the-art baselines, reducing the overuse of large LLMs by nearly 30% while maintaining strong generalization capability.
中文摘要 基于图的检索增强生成（GraphRAG）最近成为知识密集型问答的强大范式，尤其适用于需要结构化证据组织和多跳推理的任务。然而，现有的GraphRAG系统通常采用一刀切的方式构建，依赖固定的检索框架和一个通常体积庞大且昂贵的生成LLM来处理所有查询。这种静态设计限制了它们适应不同问题复杂性的能力，且常常产生不必要的计算成本。为弥补这一空白，我们提出了GraphRAG-Router，一种成本效益高的框架，采用分层路由策略协调异构的GraphRAG和生成器LLMs。具体来说，GraphRAG-Router首先通过监督微调进行热身，然后通过两阶段强化学习过程进行优化，第二阶段引入课程成本意识奖励，鼓励对难度感知且经济的生成器分配。在六个通用域和多跳质量保证基准测试上的广泛实验表明，GraphRAG-Router 持续优于最先进的基线，将大型大型语言模型的过度使用减少近30%，同时保持强大的泛化能力。

Fuzzy Encoding-Decoding to Improve Spiking Q-Learning Performance in Autonomous Driving

模糊编码-解码以提升自动驾驶中Q-learning的峰值性能

Authors: Aref Ghoreishee, Abhishek Mishra, Lifeng Zhou, John Walsh, Anup Das, Nagarajan Kandasamy
Subjects: Subjects: Neural and Evolutionary Computing (cs.NE); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2604.16436
Pdf link: https://arxiv.org/pdf/2604.16436
Abstract This paper develops an end-to-end fuzzy encoder-decoder architecture for enhancing vision-based multi-modal deep spiking Q-networks in autonomous driving. The method addresses two core limitations of spiking reinforcement learning: information loss stemming from the conversion of dense visual inputs into sparse spike trains, and the limited representational capacity of spike-based value functions, which often yields weakly discriminative Q-value estimates. The encoder introduces trainable fuzzy membership functions to generate expressive, population-based spike representations, and the decoder uses a lightweight neural decoder to reconstruct continuous Q-values from spiking outputs. Experiments on the HighwayEnv benchmark show that the proposed architecture substantially improves decision-making accuracy and closes the performance gap between spiking and non-spiking multi-modal Q-networks. The results highlight the potential of this framework for efficient and real-time autonomous driving with spiking neural networks.
中文摘要 本文开发了一种端到端模糊编码-解码器架构，用于增强自动驾驶中基于视觉的多模态深度尖峰Q网络。该方法解决了尖峰强化学习的两个核心局限：因将密集视觉输入转换为稀疏尖峰列而导致的信息丢失，以及基于尖峰的值函数表征能力有限，常常导致判别力较弱的Q值估计。编码器引入可训练的模糊成员关系函数，生成富有表现力的基于群体的尖峰表示，解码器使用轻量级神经解码器从尖峰输出中重建连续的Q值。在HighwayEnv基准测试上的实验表明，所提出的架构显著提升了决策准确性，并缩小了尖峰与非峰峰多模态Q网络之间的性能差距。结果凸显了该框架在神经网络激增下实现高效实时自动驾驶的潜力。

Sampling for Quality: Training-Free Reward-Guided LLM Decoding via Sequential Monte Carlo

质量抽样：通过顺序蒙特卡洛进行无培训奖励引导LLM解码

Authors: Jelena Markovic-Voronov, Wenhui Zhu, Bo Long, Zhipeng Wang, Suyash Gupta, Kayhan Behdin, Bee-Chung Chen, Deepak Agarwal
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2604.16453
Pdf link: https://arxiv.org/pdf/2604.16453
Abstract We introduce a principled probabilistic framework for reward-guided decoding in large language models, addressing the limitations of standard decoding methods that optimize token-level likelihood rather than sequence-level quality. Our method defines a reward-augmented target distribution over complete sequences by combining model transition probabilities with prefix-dependent reward potentials. Importantly, the approach is training-free: it leaves model weights unchanged and instead modifies the inference distribution via reward potentials, with all gains arising purely from inference-time sampling. To sample from this distribution, we develop Sequential Monte Carlo algorithms, including a computationally efficient prefix-only variant and a lookahead variant whose intermediate targets match the exact marginals of the full sequence distribution. The framework also integrates resample-move updates with Metropolis-Hastings rejuvenation and supports block-wise generation, subsuming common decoding strategies such as temperature sampling and power-tempered objectives. Empirical results across three 7B models show significant gains. On code generation (HumanEval), our method improves base performance by up to 54.9% and surpasses the strongest sampling baselines by 9.1%-15.3%. On mathematical reasoning (MATH500), it achieves gains of up to 8.8%. Notably, it reaches 87.8% on HumanEval and 78.4% on MATH500 with Qwen2.5-7B, consistently outperforming the reinforcement learning method GRPO.
中文摘要 我们引入了一种基于奖励引导解码的原则性概率框架，解决了标准译码方法的局限性，这些方法优化的是代币级似然度而非序列级质量。我们的方法通过结合模型转变概率与前缀依赖的奖励势位，定义了完整序列上的奖励增强目标分布。重要的是，该方法无需训练：保持模型权重不变，而是通过奖励势位调整推理分布，所有收益纯粹来自推理时间抽样。为了从该分布中采样，我们开发了顺序蒙特卡洛算法，包括计算效率高的仅前缀变体和一种中间目标完全匹配完整序列分布边缘的前瞻变体。该框架还将重采样移动更新与 Metropolis-Hastings 复兴整合，支持分块生成，涵盖常见的解码策略，如温度采样和功率调温目标。三个7B模型的实证结果显示显著提升。在代码生成（HumanEval）中，我们的方法将基础性能提升了最多54.9%，并且比最强的采样基线提升了9.1%-15.3%。在数学推理（MATH500）方面，其提升幅度可达8.8%。值得注意的是，它在HumanEval和Qwen2.5-7B的MATH500中达到87.8%，在78.4%上持续优于强化学习方法GRPO。

Training Language Models for Bilateral Trade with Private Information

双边信息贸易的语言模型训练

Authors: Dirk Bergemann, Soheil Ghili, Xinyang Hu, Chuanhao Li, Zhuoran Yang
Subjects: Subjects: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA); General Economics (econ.GN); Theoretical Economics (econ.TH)
Arxiv link: https://arxiv.org/abs/2604.16472
Pdf link: https://arxiv.org/pdf/2604.16472
Abstract Bilateral bargaining under incomplete information provides a controlled testbed for evaluating large language model (LLM) agent capabilities. Bilateral trade demands individual rationality, strategic surplus maximization, and cooperation to realize gains from trade. We develop a structured bargaining environment where LLMs negotiate via tool calls within an event-driven simulator, separating binding offers from natural-language messages to enable automated evaluation. The environment serves two purposes: as a benchmark for frontier models and as a training environment for open-weight models via reinforcement learning. In benchmark experiments, a round-robin tournament among five frontier models (15,000 negotiations) reveals that effective strategies implement price discrimination through sequential offers. Aggressive anchoring, calibrated concession, and temporal patience correlate with the highest surplus share and deal rate. Accommodating strategies that concede quickly disable price discrimination in the buyer role, yielding the lowest surplus capture and deal completion. Stronger models scale their behavior proportionally to item value, maintaining performance across price tiers; weaker models perform well only when wide zones of possible agreement offset suboptimal strategies. In training experiments, we fine-tune Qwen3 (8B, 14B) via supervised fine-tuning (SFT) followed by Group Relative Policy Optimization (GRPO) against a fixed frontier opponent. These stages optimize competing objectives: SFT approximately doubles surplus share but reduces deal rates, while RL recovers deal rates but erodes surplus gains, reflecting the reward structure. SFT also compresses surplus variation across price tiers, which generalizes to unseen opponents, suggesting that behavioral cloning instills proportional strategies rather than memorized price points.
中文摘要 在信息不完整下进行双边谈判为评估大型语言模型（LLM）代理能力提供了受控的测试平台。双边贸易要求个人理性、战略剩余最大化和合作以实现贸易收益。我们开发了一个结构化的谈判环境，LLMs通过事件驱动模拟器中的工具调用进行协商，将绑定性要约与自然语言消息分离，实现自动评估。该环境有两个目的：作为前沿模型的基准，以及通过强化学习为开放权重模型提供训练环境。在基准测试中，五个前沿模型的循环赛（15,000个谈判）显示，有效的策略通过连续报价实现价格歧视。积极的锚定、校准的让步和时间耐心与最高的盈余份额和交易率相关。能够迅速让步的适应策略，消除买方角色中的价格歧视，从而实现最低的剩余捕获和交易完成。更强的模型会根据物品价值按比例调整行为，保持跨价格层级的性能;弱模型只有在宽广的可能一致区抵消了次优策略时表现良好。在训练实验中，我们通过监督微调（SFT）对Qwen3（8B， 14B）进行微调，随后针对固定前沿对手进行Group Relative Policy Optimization（GRPO）进行微调。这些阶段优化了相互竞争的目标：SFT大约将剩余份额翻倍但降低交易率，而RL则恢复交易率但侵蚀剩余收益，反映其奖励结构。SFT还压缩了价格层级间的剩余变异，这推广到看不见的对手，表明行为克隆培养的是比例策略，而非记忆化的价格点。

Positive-Only Drifting Policy Optimization

仅正漂移策略优化

Authors: Qi Zhang
Subjects: Subjects: Machine Learning (cs.LG); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2604.16519
Pdf link: https://arxiv.org/pdf/2604.16519
Abstract In the field of online reinforcement learning (RL), traditional Gaussian policies and flow-based methods are often constrained by their unimodal expressiveness, complex gradient clipping, or stringent trust-region requirements. Moreover, they all rely on post-hoc penalization of negative samples to correct erroneous actions. This paper introduces Positive-Only Drifting Policy Optimization (PODPO), a likelihood-free and gradient-clipping-free generative approach for online RL. By leveraging the drifting model, PODPO performs policy updates via advantage-weighted local contrastive drifting. Relying solely on positive-advantage samples, it elegantly steers actions toward high-return regions while exploiting the inherent local smoothness of the generative model to enable proactive error prevention. In doing so, PODPO opens a promising new pathway for generative policy learning in online settings.
中文摘要 在线强化学习（RL）领域，传统的高斯策略和基于流的方法常常受限于其单模表达性、复杂梯度裁剪或严格的信任区域要求。此外，它们都依赖事后惩罚阴性样本来纠正错误行为。本文介绍了纯正漂移策略优化（PODPO），这是一种无似然且无梯度裁剪的在线强化学习生成方法。通过利用漂移模型，PODPO通过优势加权局部对比漂移来进行政策更新。它仅依赖正优势样本，优雅地引导动作朝向高回报区域，同时利用生成模型固有的局部平滑性，实现主动错误预防。通过这样做，PODPO为在线环境中生成式政策学习开辟了一条有前景的新途径。

S-GRPO: Unified Post-Training for Large Vision-Language Models

S-GRPO：大型视觉语言模型统一后训练

Authors: Yuming Yan, Kai Tang, Sihong Chen, Ke Xu, Dan Hu, Qun Yu, Pengfei Hu
Subjects: Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2604.16557
Pdf link: https://arxiv.org/pdf/2604.16557
Abstract Current post-training methodologies for adapting Large Vision-Language Models (LVLMs) generally fall into two paradigms: Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL). Despite their prevalence, both approaches suffer from inefficiencies when applied in isolation. SFT forces the model's generation along a single expert trajectory, often inducing catastrophic forgetting of general multimodal capabilities due to distributional shifts. Conversely, RL explores multiple generated trajectories but frequently encounters optimization collapse - a cold-start problem where an unaligned model fails to spontaneously sample any domain-valid trajectories in sparse-reward visual tasks. In this paper, we propose Supervised Group Relative Policy Optimization (S-GRPO), a unified post-training framework that integrates the guidance of imitation learning into the multi-trajectory exploration of preference optimization. Tailored for direct-generation visual tasks, S-GRPO introduces Conditional Ground-Truth Trajectory Injection (CGI). When a binary verifier detects a complete exploratory failure within a sampled group of trajectories, CGI injects the verified ground-truth trajectory into the candidate pool. By assigning a deterministic maximal reward to this injected anchor, S-GRPO enforces a positive signal within the group-relative advantage estimation. This mechanism reformulates the supervised learning objective as a high-advantage component of the policy gradient, compelling the model to dynamically balance between exploiting the expert trajectory and exploring novel visual concepts. Theoretical analysis and empirical results demonstrate that S-GRPO gracefully bridges the gap between SFT and RL, drastically accelerates convergence, and achieves superior domain adaptation while preserving the base model's general-purpose capabilities.
中文摘要 当前用于适应大型视觉语言模型（LVLM）的训练后方法通常分为两种范式：监督式微调（SFT）和强化学习（RL）。尽管普遍存在，但这两种方法单独应用时都存在效率低下的问题。SFT强制模型沿单一专家轨迹生成，常因分布变化导致对一般多模态能力的灾难性遗忘。相反，强化学习探索多个生成轨迹，但经常遇到优化崩溃——即未比对模型在稀疏奖励视觉任务中未能自发采样任何领域有效的轨迹的冷启动问题。本文提出监督组相对策略优化（S-GRPO），这是一个统一的训练后框架，将模仿学习的指导整合进偏好优化的多轨迹探索中。S-GRPO专为直接生成的视觉任务量身定制，引入了条件地面真实轨迹注入（CGI）。当二元验证器检测到一组采样轨迹中的完全探索性失败时，CGI会将经过验证的真实轨迹注入候选池。通过为注入锚点赋予确定性的最大奖励，S-GRPO在群体相对优势估计中强制传递正信号。该机制将监督学习目标重新表述为政策梯度中的高优势组成部分，迫使模型在利用专家路径与探索新颖视觉概念之间动态平衡。理论分析和实证结果表明，S-GRPO优雅地弥合了SFT与RL之间的鸿沟，极大加速了收敛进程，并在保持基础模型通用能力的同时实现了卓越的领域适应能力。

Agentic AI for Education: A Unified Multi-Agent Framework for Personalized Learning and Institutional Intelligence

教育中的代理人工智能：一个统一的多智能体框架，用于个性化学习和机构智能

Authors: Arya Mary K J, Deepthy K Bhaskar, Sinu T S, Binu V P
Subjects: Subjects: Multiagent Systems (cs.MA)
Arxiv link: https://arxiv.org/abs/2604.16566
Pdf link: https://arxiv.org/pdf/2604.16566
Abstract Agentic Artificial Intelligence (AI) represents a paradigm shift from reactive systems to proactive, autonomous decision making frameworks. Existing AI-based educational systems remain fragmented and lack multi-level integration across stakeholders. This paper proposes the Agentic Unified Student Support System (AUSS), a novel multi-agent architecture integrating student-level personalization, educator-level automation, and institutional-level intelligence. The framework leverages Large Language Models (LLMs), reinforcement learning, predictive analytics, and rule-based reasoning. Experimental results demonstrate improvements in recommendation accuracy (92.4%), grading efficiency (94.1%), and dropout prediction (F1-score: 89.5%). The proposed system enables scalable, adaptive, and intelligent educational ecosystems.
中文摘要 代理人工智能（AI）代表了从被动系统向主动自主决策框架的范式转变。现有基于人工智能的教育系统仍然支离破碎，缺乏跨利益相关者的多层次整合。本文提出了代理统一学生支持系统（AUSS），这是一种新颖的多代理架构，集成了学生层面的个性化、教育者层面的自动化和机构层面的智能。该框架利用大型语言模型（LLM）、强化学习、预测分析和基于规则的推理。实验结果显示，推荐准确率（92.4%）、评分效率（94.1%）和辍学预测（F1评分：89.5%）均有所提升。该系统实现了可扩展、适应性和智能化的教育生态系统。

AVRT: Audio-Visual Reasoning Transfer through Single-Modality Teachers

AVRT：通过单一模式教师进行视听推理转移

Authors: Edson Araujo, Saurabhchand Bhati, M. Jehanzeb Mirza, Brian Kingsbury, Samuel Thomas, Rogerio Feris, James R. Glass, Hilde Kuehne
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Sound (cs.SD)
Arxiv link: https://arxiv.org/abs/2604.16617
Pdf link: https://arxiv.org/pdf/2604.16617
Abstract Recent advances in reasoning models have shown remarkable progress in text-based domains, but transferring those capabilities to multimodal settings, e.g., to allow reasoning over audio-visual data, still remains a challenge, in part because of the limited availability of high-quality reasoning data in targeted multimodal combinations. To address this problem, we introduce AVRT, a novel framework that generates high-quality audio-visual reasoning traces from single-modality teacher models. We generate independent vision- and audio-reasoning traces via models specialized to reason over their respective modalities and merge the resulting traces with an LLM merger model. The resulting multimodal traces are used in a supervised fine-tuning (SFT) cold start to adapt the target model to audio-visual reasoning traces first, before training it in a second reinforcement learning stage on larger-scale data. Evaluated on seven audio-visual and audio benchmarks, our 3B and 7B parameter models achieve state-of-the-art results among models of comparable size including OmniBench and DailyOmni for audio-visual and MMAR for audio-only reasoning, showing that cross-modal training also transfers to single-modality tasks and establishing a new training pipeline for multimodal reasoning models.
中文摘要 推理模型的最新进展显示出基于文本的领域显著进展，但将这些能力转移到多模态环境中，例如允许对视听数据进行推理，仍是一大挑战，部分原因是高质量推理数据在目标多模态组合中有限。为解决这一问题，我们引入了AVRT这一新框架，能够从单一模态教师模型生成高质量的视听推理痕迹。我们通过专门推理各自模式的模型生成独立的视觉和音频推理痕迹，并将所得痕迹与大型语言模型合并。由此产生的多模态迹被用于监督微调（SFT）冷启动，先将目标模型适应视听推理轨迹，然后在更大规模数据的第二阶段强化学习中进行训练。经过七项视听和音频基准测试的评估，我们的3B和7B参数模型在同规模模型中取得了最先进的成果，包括视听领域的OmniBench和DailyOmni，以及音频推理的MMAR，显示跨模态训练同样适用于单模态任务，并为多模态推理模型建立了新的训练流水线。

DARLING: Detection Augmented Reinforcement Learning with Non-Stationary Guarantees

亲爱的：带有非平稳保证的检测增强强化学习

Authors: Argyrios Gerogiannis, Yu-Han Huang, Venugopal V. Veeravalli
Subjects: Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2604.16684
Pdf link: https://arxiv.org/pdf/2604.16684
Abstract We study model-free reinforcement learning (RL) in non-stationary finite-horizon episodic Markov decision processes (MDPs) without prior knowledge of the non-stationarity. We focus on the piecewise-stationary (PS) setting, where both the reward and transition dynamics can change an arbitrary number of times. We propose Detection Augmented Reinforcement Learning (DARLING), a modular wrapper for PS-RL that applies to both tabular and linear MDPs, without knowledge of the changes. Under certain change-point separation and reachability conditions, DARLING improves the best available dynamic regret bounds in both settings and yields strong empirical performance. We further establish the first minimax lower bounds for PS-RL in tabular and linear MDPs, showing that DARLING is the first nearly optimal algorithm. Experiments on standard benchmarks demonstrate that DARLING consistently surpasses the state-of-the-art methods across diverse non-stationary scenarios.
中文摘要 我们研究了在非平稳有限视界情节马尔可夫决策过程（MDPs）中，在不先验非平稳性的前提下进行模型无强化学习（RL）。我们关注分段-平稳（PS）环境，其中奖励和过渡动态可以任意多次变化。我们提出了检测增强强化学习（DARLING），这是一种适用于表格和线性MDP，无需了解变更的PS-RL模块化包装。在某些变更点分离和可达性条件下，DARLING在两种情况下都能提升最佳动态遗憾界限，并产生强劲的实证表现。我们进一步建立了表格和线性MDP中PS-RL的首个极小极大下界，表明DARLING是第一个近乎最优的算法。标准基准测试的实验表明，DARLING在多种非静止场景下持续超越最先进方法。

Autonomous Vehicle Collision Avoidance With Racing Parameterized Deep Reinforcement Learning

利用赛车参数化深度强化学习实现自动驾驶车辆碰撞避免

Authors: Shathushan Sivashangaran, Vihaan Dutta, Apoorva Khairnar, Sepideh Gohari, Azim Eskandarian
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2604.16702
Pdf link: https://arxiv.org/pdf/2604.16702
Abstract Road traffic accidents are a leading cause of fatalities worldwide. In the US, human error causes 94% of crashes, resulting in excess of 7,000 pedestrian fatalities and $500 billion in costs annually. Autonomous Vehicles (AVs) with emergency collision avoidance systems that operate at the limits of vehicle dynamics at a high frequency, a dual constraint of nonlinear kinodynamic accuracy and computational efficiency, further enhance safety benefits during adverse weather and cybersecurity breaches, and to evade dangerous human driving when AVs and human drivers share roads. This paper parameterizes a Deep Reinforcement Learning (DRL) collision avoidance policy Out-Of-Distribution (OOD) utilizing race car overtaking, without explicit geometric mimicry reference trajectory guidance, in simulation, with a physics-informed, simulator exploit-aware reward to encode nonlinear vehicle kinodynamics. Two policies are evaluated, a default uni-direction and a reversed heading variant that navigates in the opposite direction to other cars, which both consistently outperform a Model Predictive Control and Artificial Potential Function (MPC-APF) baseline, with zero-shot transfer to proportionally scaled hardware, across three intersection collision scenarios, at 31x fewer Floating Point Operations (FLOPS) and 64x lower inference latency. The reversed heading policy outperforms the default racing overtaking policy in head-to-head collisions by 30% and the baseline by 50%, and matches the former in side collisions, where both DRL policies evade 10% greater than numerical optimal control.
中文摘要 道路交通事故是全球死亡的主要原因之一。在美国，94%的事故由人为失误造成，导致超过7000名行人死亡，每年造成5000亿美元的费用。自动驾驶车辆（AV）配备紧急碰撞避免系统，在车辆动力学极限下高频运行，同时兼具非线性运动动力学精度和计算效率的双重约束，进一步提升恶劣天气和网络安全漏洞时的安全效益，并在自动驾驶车与人类驾驶员共用道路时避免危险的人类驾驶。本文参数化了深度强化学习（DRL）碰撞避免策略“分布外（Out-Of-Distribute，OOD）在仿真中利用赛车超车，且不明确地提供几何模仿参考轨迹引导，并以物理知情、模拟器利用感知的奖励来编码非线性车辆运动动力学。评估了两种策略，一种默认的单向行驶和一种反向方向的方向变体，后者可与其他车辆相反方向导航，这两种策略在三种交叉碰撞场景中，均持续优于模型预测控制与人工势能函数（MPC-APF）基线，实现零次切换到按比例缩放的硬件，且浮点运算次数（FLOPS）减少31倍，推断延迟降低64倍。反向驶向策略在正面碰撞中比默认竞速超车策略高出30%，基线高出50%，在侧面碰撞中与前者相当，两条DRL策略的规避都比数值最优控制多10%。

Debate as Reward: A Multi-Agent Reward System for Scientific Ideation via RL Post-Training

辩论作为奖励：通过强化学习后培训实现科学构思的多代理奖励系统

Authors: Moein Salimi, Babak Hosseini Mohtasham, Amin Aghakasiri, Mahdi Naieni, Amir Hossein Qeysarbeigi, Mohammad Masih Shalchian Nazer, Zahra Azar, Mahdi Jafari Siavoshani, Mohammad Hossein Rohban
Subjects: Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2604.16723
Pdf link: https://arxiv.org/pdf/2604.16723
Abstract Large Language Models (LLMs) have demonstrated potential in automating scientific ideation, yet current approaches relying on iterative prompting or complex multi-agent architectures often suffer from hallucination or computational inefficiency. A critical bottleneck in applying Reinforcement Learning (RL) to this open-ended domain is reward hacking -- where models exploit imperfect evaluation proxies to maximize scores without producing genuine scientific innovation. To address these limitations, we propose an RL framework explicitly tailored for high-quality scientific idea generation. We propose the first multi-agent reward function designed to serve as a judge, decoupling methodological validation from implementation details while providing strict binary rewards that are robust to reward hacking. To effectively optimize against this sparse signal, we utilize an unbiased variant of Group Relative Policy Optimization to mitigate artificial length bias. We grounded our training in ICLR-320, a curated dataset of problem-solution pairs extracted from ICLR 2024 proceedings. Experiments demonstrate that our framework significantly outperforms state-of-the-art baselines across expert-evaluated metrics of novelty, feasibility, and effectiveness.
中文摘要 大型语言模型（LLMs）已展现出自动化科学构思的潜力，但目前依赖迭代提示或复杂多代理架构的方法常常存在幻觉或计算效率低下的问题。在将强化学习（RL）应用于这一开放领域时，一个关键瓶颈是奖励黑客——模型利用不完美的评估代理来最大化分数，却不产生真正的科学创新。为解决这些局限性，我们提出了一个专门为高质量科学创意生成量身定制的强化学习框架。我们提出了首个多智能体奖励函数，设计为评判，将方法论验证与实现细节解耦，同时提供严格的二元奖励，以应对奖励黑客行为。为了有效针对这种稀疏信号进行优化，我们采用了无偏的组相对策略优化（Group Relative Policy Optimization）来减轻人为长度偏差。我们的培训基础是ICLR-320，这是一个从ICLR 2024会议论文集中提取的问题-解决方案对的策划数据集。实验显示，我们的框架在专家评估的创新性、可行性和有效性指标上显著优于最先进基线。

Active World-Model with 4D-informed Retrieval for Exploration and Awareness

采用4D导向检索的主动世界模型，促进探索与意识提升

Authors: Elaheh Vaezpour, Amirhosein Javadi, Tara Javidi
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2604.16733
Pdf link: https://arxiv.org/pdf/2604.16733
Abstract Physical awareness, especially in a large and dynamic environment, is shaped by sensing decisions that determine observability across space, time, and scale, while observations impact the quality of sensing decisions. This loopy information structure makes physical awareness a fundamentally challenging decision problem with partial observations. While in the past decade we have witnessed the unprecedented success of reinforcement learning (RL) in problems with full observability, decision problems with partial observation, such as POMDPs, remain largely open: real-world explorations are excessively costly, while sim-to-real pipeline suffer from unobserved viewpoints. We introduce AW4RE (Active World-model with 4D-informed Retrieval for Exploration), an awareness-centric generative world model that provides a sensor-native surrogate environment for exploring sensing queries. Conditioned on a queried sensing action, AW4RE estimates the action-conditioned observation process. This is done by combining 4D-informed evidence retrieval, action-conditioned geometric support with temporal coherence, and conditional generative completion. Experiments demonstrate that AW4RE produces more grounded and consistent predictions than geometry-aware generative baselines under extreme viewpoint shifts, temporal gaps, and sparse geometric support.
中文摘要 物理感知，尤其是在大型且动态的环境中，受感知决策影响，这些决策决定了跨空间、时间和尺度的可观测性，而观察则影响感知决策的质量。这种复杂的信息结构使得物理意识成为一个具有部分观测的根本性挑战决策问题。虽然在过去十年里，我们见证了强化学习（RL）在完全可观测问题中的前所未有的成功，但部分观察的决策问题，如POMDPs，仍然大多未解：现实世界的探索成本过高，而模拟到现实的流水线则存在未被观察到的视角。我们介绍AW4RE（带有4D导向检索的主动世界模型），这是一种以意识为中心的生成世界模型，提供传感器原生的替代环境，用于探索感知查询。AW4RE基于查询的感应动作，估计动作条件下的观察过程。这通过结合四维证据检索、动作条件几何支持与时间连贯性以及条件生成完成来实现。实验表明，在极端视点转换、时间间隙和稀疏的几何支持下，AW4RE比几何感知生成基线能产生更扎实且一致的预测。

Privacy-Aware Machine Unlearning with SISA for Reinforcement Learning-Based Ransomware Detection

基于强化学习的基于SISA的隐私感知机器学习解构

Authors: Jannatul Ferdous, Rafiqul Islam, Md Zahidul Islam
Subjects: Subjects: Cryptography and Security (cs.CR)
Arxiv link: https://arxiv.org/abs/2604.16760
Pdf link: https://arxiv.org/pdf/2604.16760
Abstract Ransomware detection systems increasingly rely on behavior-based machine learning to address evolving attack strategies. However, emerging privacy compliance, data governance, and responsible AI deployment demand not only accurate detection but also the ability to efficiently remove the influence of specific training samples without retraining the models from scratch. In this study, we present a privacy-aware machine unlearning evaluation framework for reinforcement learning (RL)-based ransomware detection built on Sharded, Isolated, Sliced, and Aggregated (SISA) training. The framework enables efficient data deletion by retraining only the affected model shards rather than the entire detector, reducing the retraining cost while preserving detection performance. We conduct a controlled comparative study using value-based RL agents, including Deep Q-Network (DQN) and Double Deep Q-Network (DDQN), under identical experimental settings with a cost-sensitive reward design and 5-fold cross-validation on Windows 11 ransomware dataset. Detection confidence is evaluated using a continuous Q-score margin, enabling ROC-AUC analysis beyond binary predictions. For unlearning, the dataset is partitioned into five shards with majority-vote aggregation, and a fast-unlearning path is evaluated by deleting 5% of the samples from a single shard and retraining only that shard. Results show that SISA-based unlearning incurs negligible utility degradation (<= 0.05 percent F1 drop) while substantially reducing retraining time relative to full SISA retraining. DDQN exhibits slightly improved stability and lower utility loss than DQN, while both agents maintain near identical in-distribution performance after unlearning. These findings indicate that SISA provides an efficient unlearning mechanism for RL-based ransomware detection, supporting privacy-aware deployment without compromising security effectiveness.
中文摘要 勒索软件检测系统越来越依赖基于行为的机器学习来应对不断演变的攻击策略。然而，隐私合规、数据治理和负责任的AI部署不仅要求准确检测，还需要高效去除特定训练样本的影响，而无需从零重新训练模型。本研究提出了基于分片、隔离、切片和聚合（SISA）训练的基于强化学习（RL）勒索软件检测的隐私意识机器学习去学习评估框架。该框架通过仅重训受影响的模型碎片而非整个检测器，实现了高效的数据删除，降低了重训练成本，同时保持了检测性能。我们使用基于价值的强化学习代理（包括深度Q网络（DQN）和双深度Q网络（DDQN）进行对照比较研究，在相同的实验环境下，采用成本敏感的奖励设计和Windows 11勒索软件数据集的5倍交叉验证。检测置信度通过连续的Q分数边际评估，从而实现超越二元预测的ROC-AUC分析。对于还学习，数据集被划分为五个碎片，采用多数票聚合，快速复学路径通过删除单个碎片中5%的样本并仅重新训练该碎片来评估。结果显示，基于SISA的去学习对效用退化几乎不存在（<= 0.05%的F1下降），同时相较于完全SISA再培训，显著缩短了再培训时间。DDQN的稳定性略有提升，效用损失更低，且两者在去学习后保持的分布内表现几乎相同。这些发现表明，SISA为基于强化学习的勒索软件检测提供了高效的“去学习”机制，支持隐私意识的部署，同时不影响安全效能。

A Stackelberg Game Framework with Drainability Guardrails for Pricing and Scaling in Multi-Tenant GPU Cloud Platforms

一个带有排水保护措施的Stackelberg游戏框架，用于多租户GPU云平台的定价和扩展

Authors: Junji Yan, Asrin Efe Yorulmaz, Hanchen Zhou, Tamer Başar
Subjects: Subjects: Computer Science and Game Theory (cs.GT); Systems and Control (eess.SY); Optimization and Control (math.OC)
Arxiv link: https://arxiv.org/abs/2604.16802
Pdf link: https://arxiv.org/pdf/2604.16802
Abstract Modern Graphics Processing Unit (GPU)-backed services must satisfy strict latency service-level objectives (SLOs) while controlling spare-capacity cost. In multi-tenant GPU cloud platforms, this trade-off is inherently dynamic because workload demand is endogenous; specifically, pricing shapes the submissions of heterogeneous tenants, which subsequently impact congestion and delay. We formulate the joint pricing-and-scaling problem as a large-population Stackelberg game problem, and we derive an explicit equilibrium demand map. The resulting closed-loop model reveals a structural failure mode in which delay-insensitive workloads sustain a residual demand floor, making the backlog undrainable under bounded price and service capacity. This observation motivates a computable drainability guardrail that certifies uniformly negative drift in the residual-demand regime. For any fixed price-capacity pair satisfying the drainability guardrail, we establish a unique operating point and global convergence towards it under a checkable step-size condition. Building on this fixed-pair analysis, we further develop an optimizer-agnostic action shield for the full dynamic problem and show empirically that it improves safety and robustness for model-free reinforcement learning (RL) in this setting.
中文摘要 现代图形处理单元（GPU）支持的服务必须满足严格的延迟服务级别目标（SLO），同时控制备用容量成本。在多租户GPU云平台中，这种权衡本质上是动态的，因为工作负载需求是内生的;具体来说，价格会影响异质租户的提交，进而影响拥堵和延误。我们将联合定价与扩展问题表述为一个大群体的斯塔克尔伯格博弈问题，并推导出一个显式的均衡需求图。由此产生的闭环模型揭示了一种结构性失效模式，即对延迟不敏感的工作负载维持剩余需求底线，使积压在价格和服务容量有限下无法排水。这一观察促使了可计算排水性护栏的建立，以证明残余需求区间均匀的负漂移。对于任何满足排水保护栏的固定价格-容量对，我们建立一个唯一的操作点及其在可检验步长条件下的全局收敛。基于该固定对分析，我们进一步开发了针对全动态问题的优化器-无关作用屏蔽，并实证证明该保护盾在该环境中提升了无模型强化学习（RL）的安全性和鲁棒性。

AutoOR: Scalably Post-training LLMs to Autoformalize Operations Research Problems

AutoOR：可扩展的后期训练LLM以自我形式化运筹学问题

Authors: Sumeet Ramesh Motwani, Chuan Du, Aleksander Petrov, Christopher Davis, Philip Torr, Antonio Papania-Davis, Weishi Yan
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2604.16804
Pdf link: https://arxiv.org/pdf/2604.16804
Abstract Optimization problems are central to decision-making in manufacturing, logistics, scheduling, and other industrial settings. Translating complicated descriptions of these problems into solver-ready formulations requires specialized operations research (OR) expertise, making it hard to scale. We present AutoOR, a scalable synthetic data generation and reinforcement learning pipeline that trains LLMs to autoformalize optimization problems specified in natural language across linear, mixed-integer, and non-linear categories. AutoOR generates verified training data from standard optimization forms and uses solver execution feedback as the reward signal for RL post-training. AutoOR applied to an 8B model achieves state-of-the-art or competitive results across six established OR benchmarks, matching significantly larger frontier models. For a non-linear problem class involving physical dynamics, where frontier models score near 0%, we introduce a curriculum RL strategy that bootstraps from limited initial training data to make this class tractable for post-training. We believe that methods such as AutoOR can significantly accelerate industrial decision-making with AI.
中文摘要 优化问题在制造、物流、调度及其他工业环境中的决策中至关重要。将这些问题的复杂描述转化为可求解的表述需要专门的运筹学（OR）专业知识，因此难以实现规模化。我们介绍AutoOR系统，一种可扩展的合成数据生成与强化学习流水线，能够训练LLM将自然语言指定的优化问题在线性、混合整数和非线性类别中自形式化。AutoOR从标准优化表单生成经过验证的训练数据，并利用求解器执行反馈作为强化学习训练后的奖励信号。应用于8B模型的AutoOR在六个既定的OR基准测试中实现了最先进或具竞争力的结果，与更大的前沿模型相匹配。对于涉及物理动力学的非线性问题类，前沿模型得分接近0%，我们引入了一种课程强化学习策略，从有限的初始训练数据中自助，使该类适合后续训练。我们相信，像AutoOR这样的方法能够显著加速工业决策，借助人工智能。

Q-DeepSight: Incentivizing Thinking with Images for Image Quality Assessment and Refinement

Q-DeepSight：通过图像激励思维以评估和优化图像质量

Authors: Xudong Li, Jiaxi Tan, Ziyin Zhou, Yan Zhong, Zihao Huang, Jingyuan Zheng, Yan Zhang, Xiawu Zheng, Rongrong Ji
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2604.16858
Pdf link: https://arxiv.org/pdf/2604.16858
Abstract Image Quality Assessment (IQA) models are increasingly deployed as perceptual critics to guide generative models and image restoration. This role demands not only accurate scores but also actionable, localized feedback. However, current MLLM-based methods adopt a single-look, language-only paradigm, which departs from human evidence-seeking judgment and yields weakly grounded rationales, limiting their reliability for in-the-loop refinement. We propose Q-DeepSight, a think-with-image framework that emulates this human-like process. It performs interleaved Multimodal Chain-of-Thought (iMCoT) with tool-augmented evidence acquisition (e.g., crop-and-zoom) to explicitly determine where quality degrades and why. To train these long iMCoT trajectories via reinforcement learning, we introduce two techniques: Perceptual Curriculum Reward (PCR) to mitigate reward sparsity and Evidence Gradient Filtering (EGF) to improve credit assignment for visually-grounded reasoning. Q-DeepSight achieves state-of-the-art performance across diverse benchmarks, including natural, restored, and AI-generated content. Furthermore, we demonstrate its practical value with Perceptual-in-Generation (PiG), a training-free framework where Q-DeepSight's diagnoses guide iterative image enhancement, effectively closing the loop between assessment and refinement.
中文摘要 图像质量评估（IQA）模型正日益被用作感知批评工具，指导生成模型和图像修复。这个角色不仅要求准确的分数，还需要可操作的、本地化的反馈。然而，当前基于MLLM的方法采用单一观察、仅语言的范式，这背离了人类寻求证据的判断，且给出的理由基础薄弱，限制了其在内部细化中的可靠性。我们提出了Q-DeepSight，一种模拟这种类人化过程的“思维与图像”框架。它通过工具增强证据采集（如裁剪缩放）进行交错多模态思维链（iMCoT），明确确定质量下降的地点及其原因。为了通过强化学习训练这些长的iMCoT轨迹，我们引入了两种技术：感知课程奖励（PCR）以减少奖励稀疏性，以及证据梯度过滤（EGF）以提升视觉基础推理的学分分配。Q-DeepSight在多种基准测试中实现了最先进的性能，包括自然、修复和AI生成内容。此外，我们通过Perceptual-in-Generation（PiG）演示了其实用价值，该框架无需培训，Q-DeepSight的诊断指导迭代图像增强，有效闭合了评估与精炼之间的循环。

GRAIL: Autonomous Concept Grounding for Neuro-Symbolic Reinforcement Learning

圣杯：神经符号强化学习的自主概念基础

Authors: Hikaru Shindo, Henri Rößler, Quentin Delfosse, Kristian Kersting
Subjects: Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2604.16871
Pdf link: https://arxiv.org/pdf/2604.16871
Abstract Neuro-symbolic Reinforcement Learning (NeSy-RL) combines symbolic reasoning with gradient-based optimization to achieve interpretable and generalizable policies. Relational concepts, such as "left of" or "close by", serve as foundational building blocks that structure how agents perceive and act. However, conventional approaches require human experts to manually define these concepts, limiting adaptability since concept semantics vary across environments. We propose GRAIL (Grounding Relational Agents through Interactive Learning), a framework that autonomously grounds relational concepts through environmental interaction. GRAIL leverages large language models (LLMs) to provide generic concept representations as weak supervision, then refines them to capture environment-specific semantics. This approach addresses both sparse reward signals and concept misalignment prevalent in underdetermined environments. Experiments on the Atari games Kangaroo, Seaquest, and Skiing demonstrate that GRAIL matches or outperforms agents with manually crafted concepts in simplified settings, and reveals informative trade-offs between reward maximization and high-level goal completion in the full environment.
中文摘要 神经符号强化学习（NeSy-RL）将符号推理与基于梯度的优化相结合，以实现可解释且可推广的策略。关系概念，如“左边”或“近旁”，是构建代理感知和行动的基础构件。然而，传统方法要求人类专家手动定义这些概念，限制了适应性，因为概念语义因环境而异。我们提出了GRAIL（通过互动学习奠定关系代理基础）框架，该框架通过环境互动自主地为关系概念建立基础。GRAIL利用大型语言模型（LLMs）提供通用概念表示作为弱监督，然后对其进行细化以捕捉环境特定的语义。该方法解决了奖励信号稀疏和概念错位，这些问题在未确定环境中普遍存在。在Atari游戏《袋鼠》、《海探》和《滑雪》上的实验表明，GRAIL在简化环境中通过手动设计的概念能匹敌甚至超过代理，并揭示了奖励最大化与高水平目标完成之间的宝贵权衡。

Incentivizing Parametric Knowledge via Reinforcement Learning with Verifiable Rewards for Cross-Cultural Entity Translation

通过强化学习激励参数化知识，并提供可验证的跨文化实体翻译奖励

Authors: Jiang Zhou, Xiaohu Zhao, Xinwei Wu, Tianyu Dong, Hao Wang, Yangyang Liu, Heng Liu, Linlong Xu, Longyue Wang, Weihua Luo, Deyi Xiong
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2604.16881
Pdf link: https://arxiv.org/pdf/2604.16881
Abstract Cross-cultural entity translation remains challenging for large language models (LLMs) as literal or phonetic renderings are usually yielded instead of culturally appropriate translations in context. However, relevant knowledge may already be encoded in model parameters during large-scale pre-training. To incentivize the effective use of parametric knowledge, we propose EA-RLVR (Entity-Anchored Reinforcement Learning with Verifiable Rewards), a training framework that optimizes cross-cultural entity translation without relying on external knowledge bases. EA-RLVR anchors supervision on a verifiable, entity-level reward signal and incorporates lightweight structural gates to stabilize optimization. This design steers the model toward learning a robust reasoning process rather than merely imitating reference translations. We evaluate EA-RLVR on XC-Translate and observe consistent improvements in both entity translation accuracy and out-of-domain generalization. Specifically, training on merely 7k samples boosts Qwen3-14B's entity translation accuracy from 23.66\% to 31.87\% on a 50k test set comprising entirely unseen entities. The learned entity translation ability also transfers to general translation, yielding +1.35 XCOMET on WMT24++, which scales to +1.59 with extended optimization. Extensive analyses of $pass@k$ dynamics and reward formulations attribute these gains to superior sampling efficiency and a stable optimization landscape.
中文摘要 对于大型语言模型（LLM）来说，跨文化实体翻译依然具有挑战性，因为通常只能用直译或语音表示，而不是在语境中进行文化适宜的翻译。然而，相关知识可能已经编码在大规模预训练的模型参数中。为了激励参数化知识的有效应用，我们提出了EA-RVR（可验证奖励的实体锚定强化学习），这是一个无需依赖外部知识库即可优化跨文化实体翻译的培训框架。EA-RLVR将监督锚定在可验证的实体级奖励信号上，并集成了轻量级结构门以稳定优化。这种设计引导模型学习一个稳健的推理过程，而不仅仅是模仿引用翻译。我们在XC-Translate上评估EA-RLVR，观察到实体翻译准确性和域外泛化均有持续提升。具体来说，仅用70000个样本训练，Qwen3-14B在5万个完全未见实体的测试集上的实体翻译准确率从23.66%提升到31.87%。所学实体翻译能力也可转移至通用翻译，在WMT24++上获得+1.35 XCOMET，经过扩展优化可扩展至+1.59。对$pass@k$动态和奖励公式的广泛分析将这些收益归因于更优的采样效率和稳定的优化环境。

EasyVideoR1: Easier RL for Video Understanding

EasyVideoR1：更简单的强化学习视频理解

Authors: Chuanyu Qin, Chenxu Yang, Qingyi Si, Naibin Gu, Dingyu Yao, Zheng Lin, Peng Fu, Nan Duan, Jiaqi Wang
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2604.16893
Pdf link: https://arxiv.org/pdf/2604.16893
Abstract Reinforcement learning from verifiable rewards (RLVR) has demonstrated remarkable effectiveness in improving the reasoning capabilities of large language models. As models evolve into natively multimodal architectures, extending RLVR to video understanding becomes increasingly important yet remains largely unexplored, due to the diversity of video task types, the computational overhead of repeatedly decoding and preprocessing high-dimensional visual inputs, and the difficulty of reproducible evaluation across numerous sensitive hyperparameters. Existing open-source RL training frameworks provide solid infrastructure for text and image scenarios but lack systematic optimizations tailored for video modality. In this work, we present \textbf{EasyVideoR1}, a complete and efficient reinforcement learning framework specifically designed for training large vision-language models on video understanding tasks. EasyVideoR1 makes the following contributions: (1) a full video RL training pipeline with offline preprocessing and tensor caching that eliminates redundant video decoding and yields a 1.47 $\times$ throughput improvement; (2) a comprehensive, task-aware reward system covering 11 distinct video and image problem types with unified routing and modular extension; (3) a mixed offline-online data training paradigm that combines curated high-quality trajectories with on-policy exploration, benefiting the learning of more challenging tasks; (4) joint image-video training with independently configurable pixel budgets, allowing the two modalities to mutually reinforce each other; and (5) an asynchronous multi-benchmark evaluation framework covering 22 mainstream video understanding benchmarks, with reproduced accuracy closely aligned with officially reported scores.
中文摘要 可验证奖励强化学习（RLVR）在提升大型语言模型推理能力方面表现出显著效果。随着模型向原生多模态架构发展，将RLVR扩展到视频理解变得越来越重要，但由于视频任务类型多样、反复解码和预处理高维视觉输入的计算开销，以及在众多敏感超参数上可重复评估的困难，RLVR在视频理解上仍然鲜为人知。现有的开源强化学习训练框架为文本和图像场景提供了坚实的基础设施，但缺乏针对视频模态的系统优化。在本研究中，我们提出了 \textbf{EasyVideoR1}，一个完整高效的强化学习框架，专门为大型视觉语言模型的视频理解任务训练而设计。EasyVideoR1 做出了以下贡献：（1）完整的视频强化学习训练流水线，支持离线预处理和张量缓存，消除冗余的视频解码，吞吐量提升1.47 $\时间;（2）一个涵盖11种不同视频和图像问题类型的全面任务感知奖励系统，具有统一的路由和模块化扩展;（3）结合高质量轨迹与政策探索相结合的线下-在线混合数据训练范式，有助于学习更具挑战性的任务;（4）图像-视频联合训练，像素预算可独立配置，使两种模式能够相互强化;以及（5）一个涵盖22个主流视频理解基准的异步多基准评估框架，其复现准确度与官方报告的分数高度一致。

Freshness-Aware Prioritized Experience Replay for LLM/VLM Reinforcement Learning

LLM/VLM强化学习的新鲜感感知优先体验重放

Authors: Weiyu Ma, Yongcheng Zeng, Yan Song, Xinyu Cui, Jian Zhao, Xuhui Liu, Mohamed Elhoseiny
Subjects: Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2604.16918
Pdf link: https://arxiv.org/pdf/2604.16918
Abstract Reinforcement Learning (RL) has achieved impressive success in post-training Large Language Models (LLMs) and Vision-Language Models (VLMs), with on-policy algorithms such as PPO, GRPO, and REINFORCE++ serving as the dominant paradigm. However, these methods discard all collected trajectories after a single gradient update, resulting in poor sample efficiency, particularly wasteful for agentic tasks where multi-turn environment interactions are expensive. While Experience Replay drives sample efficiency in classic RL by allowing agents to reuse past trajectories and prioritize informative ones, directly applying Prioritized Experience Replay (PER) to LLMs fails. The rapid policy evolution of billion-parameter models renders stored priorities stale, causing old high-priority trajectories to dominate sampling long after they have become uninformative. We propose Freshness-Aware PER, which addresses this priority staleness problem by augmenting any PER-based priority with a multiplicative exponential age decay grounded in effective sample size analysis. To the best of our knowledge, Freshness-Aware PER is the first work to successfully apply PER to LLM/VLM reinforcement learning. We evaluate on eight multi-step agentic, reasoning, and math competition tasks with 0.5B, 3B, and 7B models. Freshness-Aware PER significantly outperforms on-policy baselines, achieving +46% on NQ Search, +367% on Sokoban, and +133% on VLM FrozenLake, while standard PER without age decay consistently degrades performance. Our code is publicly available at this https URL.
中文摘要 强化学习（RL）在大型语言模型（LLM）和视觉语言模型（VLMs）的后训练中取得了显著成功，策略化算法如PPO、GRPO和REINFORCE++成为主流范式。然而，这些方法在一次梯度更新后会丢弃所有收集的轨迹，导致样本效率低下，尤其对多回合环境交互代价高的代理任务来说尤为浪费。虽然经验重放通过允许代理重用过去轨迹并优先排序信息性轨迹来提升传统强化学习的样本效率，但直接将优先级经验重放（PER）应用于大型语言模型则失败了。十亿参数模型的快速政策演进使得存储的优先级变得陈旧，导致旧有的高优先级轨迹在变得无信息量后仍主导抽样。我们提出了新鲜度感知PER方法，通过在有效样本量分析基础上，以乘法指数年龄衰减来解决基于PER的优先级陈旧问题。据我们所知，Freshness-Aware PER是首个成功将PER应用于LLM/VLM强化学习的研究。我们用0.5B、3B和7B模型评估八个多步骤的代理、推理和数学竞赛任务。新度感知PER显著优于政策基线，NQ搜索为+46%，Sokoban为+367%，在VLM FrozenLake为+133%，而无老化的标准PER持续降低性能。我们的代码在此 https URL 公开。

Multi-stage Planning for Multi-target Surveillance using Aircrafts Equipped with Synthetic Aperture Radars Aware of Target Visibility

多阶段多目标监视规划，配备合成孔径雷达的飞机，感知目标可见性

Authors: Daniel Fuertes, Carlos R. del-Blanco, Fernando Jaureguizar, Juan José Navarro-Corcuera, Narciso García
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2604.16962
Pdf link: https://arxiv.org/pdf/2604.16962
Abstract Generating trajectories for synthetic aperture radar (SAR)-equipped aircraft poses significant challenges due to terrain constraints, and the need for straight-flight segments to ensure high-quality imaging. Related works usually focus on trajectory optimization for predefined straight-flight segments that do not adapt to the target visibility, which depends on the 3D terrain and aircraft orientation. In addition, this assumption does not scale well for the multi-target problem, where multiple straight-flight segments that maximize target visibility must be defined for real-time operations. For this purpose, this paper presents a multi-stage planning system. First, the waypoint sequencing to visit all the targets is estimated. Second, straight-flight segments maximizing target visibility according to the 3D terrain are predicted using a novel neural network trained with deep reinforcement learning. Finally, the segments are connected to create a trajectory via optimization that imposes 3D Dubins curves. Evaluations demonstrate the robustness of the system for SAR missions since it ensures high-quality multi-target SAR image acquisition aware of 3D terrain and target visibility, and real-time performance.
中文摘要 为配备合成孔径雷达（SAR）的飞机生成轨迹存在重大挑战，原因是地形限制以及需要直线飞行段以确保高质量成像。相关工作通常聚焦于预定义直线段的轨迹优化，这些段无法适应目标可见度，而目标视野取决于三维地形和飞机朝向。此外，这一假设在多目标问题中不适合扩展，因为在多目标问题中，实时操作必须定义多个直线飞行段以最大化目标可见性。为此，本文提出了一套多阶段规划系统。首先，估算访问所有目标的航点顺序。其次，通过一种经过深度强化学习训练的新型神经网络，预测了根据三维地形最大化目标可见性的直线飞行段。最后，这些段通过优化形成轨迹，从而施加三维杜宾斯曲线。评估显示该系统在SAR任务中的稳健性，确保高质量的多目标SAR图像采集，关注三维地形和目标可见性，并具备实时性能。

NaviFormer: A Deep Reinforcement Learning Transformer-like Model to Holistically Solve the Navigation Problem

NaviFormer：一个深度强化学习类变换器模型，整体解决导航问题

Authors: Daniel Fuertes, Andrea Cavallaro, Carlos R. del-Blanco, Fernando Jaureguizar, Narciso García
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2604.16967
Pdf link: https://arxiv.org/pdf/2604.16967
Abstract Path planning is usually solved by addressing either the (high-level) route planning problem (waypoint sequencing to achieve the final goal) or the (low-level) path planning problem (trajectory prediction between two waypoints avoiding collisions). However, real-world problems usually require simultaneous solutions to the route and path planning subproblems with a holistic and efficient approach. In this paper, we introduce NaviFormer, a deep reinforcement learning model based on a Transformer architecture that solves the global navigation problem by predicting both high-level routes and low-level trajectories. To evaluate NaviFormer, several experiments have been conducted, including comparisons with other algorithms. Results show competitive accuracy from NaviFormer since it can understand the constraints and difficulties of each subproblem and act consequently to improve performance. Moreover, its superior computation speed proves its suitability for real-time missions.
中文摘要 路径规划通常通过解决（高层次）路线规划问题（实现最终目标的航点排序）或（低层次）路径规划问题（两个航点之间的轨迹预测以避免碰撞）来解决。然而，现实问题通常需要同时对路线和路径规划子问题提出整体且高效的解决方案。本文介绍了NaviFormer，一种基于Transformer架构的深度强化学习模型，通过预测高层路径和低层轨迹来解决全局导航问题。为了评估NaviFormer，进行了多项实验，包括与其他算法的比较。结果显示，NaviFormer具有竞争力的准确性，因为它能够理解每个子问题的约束和难度，并据此采取行动提升性能。此外，其卓越的计算速度证明了其适合实时任务。

MCPO: Mastery-Consolidated Policy Optimization for Large Reasoning Models

MCPO：大型推理模型的掌握整合策略优化

Authors: Zhaokang Liao, Yingguo Gao, Yi Yang, Yongheng Hu, Jingting Ding
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2604.16972
Pdf link: https://arxiv.org/pdf/2604.16972
Abstract Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a promising approach to improve the reasoning abilities of Large Language Models (LLMs). Among RLVR algorithms, Group Relative Policy Optimization (GRPO) and its variants have demonstrated strong performance and high training efficiency. However, GRPO-style objectives exhibit two issues on high accuracy prompts including mastered prompts (rollout accuracy =1) and majority-correct prompts (rollout accuracy in (0.5,1)). For mastered prompts, group-relative advantages vanish, yielding no training signal and unconstrained policy drift that can cause forgetting. For majority-correct prompts, the induced query weight shrinks as accuracy increases, weakening consolidation from partial correctness to mastery. To alleviate this, we propose Mastery-Consolidated Policy Optimization (MCPO), which introduces (i) a hinge-KL regularizer applied exclusively to mastered prompts to bound harmful policy drift between successive gradient steps, and (ii) a weighting mechanism that prioritizes majority-correct prompts to better allocate optimization effort. Extensive experiments across three mathematical benchmarks demonstrate that MCPO consistently improves pass@1 performance. Counter-intuitively, rather than restricting exploration, MCPO boosts pass@k metrics, indicating that mastery consolidation further catalyzes solution diversity.
中文摘要 带可验证奖励的强化学习（RLVR）已成为一种有前景的方法，旨在提升大型语言模型（LLMs）的推理能力。在RLVR算法中，群相对策略优化（GRPO）及其变体展现出强劲的性能和高训练效率。然而，GRPO风格的目标在高精度提示上存在两个问题，包括已掌握的提示（rollout accuracy =1）和多数正确提示（rollout accuracy in （0.5,1））。对于精通提示，群体相对优势消失，导致无训练信号和无约束的策略漂移，可能导致遗忘。对于大多数正确提示，随着准确率的提升，诱导的查询权重会缩小，从而削弱了从部分正确到掌握的巩固。为缓解这一问题，我们提出了掌握-整合策略优化（MCPO），引入了（i）专门应用于已掌握提示的铰链KL正则化器，以限制有害的策略漂移在连续梯度步骤之间，以及（ii）一种权重机制，优先考虑多数正确的提示，以更好地分配优化工作。涵盖三个数学基准的广泛实验表明，MCPO能够持续提升pass@1性能。反直觉的是，MCPO不仅没有限制探索，反而提升了pass@k指标，表明掌握度的巩固进一步促进了解决方案的多样性。

SPS: Steering Probability Squeezing for Better Exploration in Reinforcement Learning for Large Language Models

SPS：引导概率挤压，促进大型语言模型强化学习的更好探索

Authors: Yifu Huo, Chenglong Wang, Ziming Zhu, Shunjie Xing, Peinan Feng, Tongran Liu, Qiaozhi He, Tianhua Zhou, Xiaojia Chang, Jingbo Zhu, Zhengtao Yu, Tong Xiao
Subjects: Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2604.16995
Pdf link: https://arxiv.org/pdf/2604.16995
Abstract Reinforcement learning (RL) has emerged as a promising paradigm for training reasoning-oriented models by leveraging rule-based reward signals. However, RL training typically tends to improve single-sample success rates (i.e., Pass@1) while offering limited exploration of diverse reasoning trajectories, which is crucial for multi-sample performance (i.e., Pass@k). Our preliminary analysis reveals that this limitation stems from a fundamental squeezing effect, whereby probability mass is excessively concentrated on a narrow subset of high-reward trajectories, restricting genuine exploration and constraining attainable performance under RL training. To address this issue, in this work, we propose Steering Probability Squeezing (SPS), a training paradigm that interleaves conventional RL with inverse reinforcement learning (IRL). SPS treats on-policy rollouts as demonstrations and employs IRL to explicitly reshape the induced trajectory distribution, thereby enhancing exploration without introducing external supervision. Experiments on five commonly used reasoning benchmarks demonstrate that SPS can enable better exploration and improve Pass@k. Beyond algorithmic contributions, we provide an analysis of RL learning dynamics and identify an empirical upper bound on Pass@k, shedding light on intrinsic exploration limits in RL-based reasoning models. Our findings suggest that alternating between RL and IRL offers an effective pathway toward extending the exploration capacity of reasoning-oriented large language models.
中文摘要 强化学习（RL）已成为一种有前景的范式，用于利用基于规则的奖励信号训练以推理为导向的模型。然而，强化学习训练通常倾向于提高单样本成功率（即Pass@1），同时有限度地探索多样推理轨迹，这对多样本表现（即Pass@k）至关重要。我们的初步分析显示，这一限制源于一种基本的挤压效应，即概率质量过度集中在高回报轨迹的狭窄子集上，限制了真正的探索并限制了强化学习下可实现的表现。为解决这一问题，本研究提出了引导概率挤压（Steering Probability Squeezing，简称SPS），这是一种将传统强化学习与逆强化学习（IRL）交织的训练范式。SPS将政策上的推广视为演示，并利用IRL明确重塑诱导轨迹分布，从而增强探索性，而无需引入外部监督。五个常用推理基准测试的实验表明，SPS能够促进更好的探索和提升Pass@k。除了算法贡献外，我们还分析了强化学习动态，并确定了Pass@k的经验上界，揭示了基于强化学习推理模型的内在探索极限。我们的发现表明，在强化学习（RL）与现实学习（IRL）之间交替，是扩展推理导向大型语言模型探索能力的有效途径。

Small Model as Master Orchestrator: Learning Unified Agent-Tool Orchestration with Parallel Subtask Decomposition

小模型作为主编排器：学习带并行子任务分解的统一代理-工具编排

Authors: Wenzhen Yuan, Wutao Xiong, Fanchen Yu, Shengji Tang, Ting Liu, Tao Chen, Peng Ye, Yuzhuo Fu, Wanli Ouyang, Lei Bai
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2604.17009
Pdf link: https://arxiv.org/pdf/2604.17009
Abstract Multi-agent systems (MAS) demonstrate clear advantages in tackling complex problems by coordinating diverse agents and external tools. However, most existing orchestration methods rely on static workflows or serial agent scheduling, and are further constrained by heterogeneous interface protocols between tools and agents. This leads to high system complexity and poor extensibility. To mitigate these issues, we propose Agent-as-Tool, a unified parallel orchestration paradigm that abstracts both agents and tools into a standardized, learnable action space with protocol normalization and explicit state feedback. Building on this paradigm, we train a lightweight orchestrator, ParaManager, which decouples planning decisions from subtask solving, enabling state-aware parallel subtask decomposition, delegation, and asynchronous execution. For training, we adopt a two-stage ParaManager training pipeline. It improves robustness by incorporating supervised fine-tuning (SFT) trajectories equipped with recovery mechanisms, and further applies reinforcement learning (RL) to achieve an optimal balance among task success, protocol compliance, diversity, and reasoning efficiency. Experiments show that ParaManager achieves strong performance across multiple benchmarks and exhibits robust generalization under unseen model pools.
中文摘要 多智能体系统（MAS）通过协调多样化的智能体和外部工具，在解决复杂问题方面展现了明显优势。然而，大多数现有的编排方法依赖静态工作流或串行代理调度，并且还受到工具与代理之间异构接口协议的进一步限制。这导致系统复杂度高且可扩展性较差。为缓解这些问题，我们提出了Agent-as-Tool，一种统一的并行编排范式，将代理和工具抽象为一个标准化、可学习的动作空间，并具备协议规范化和显式状态反馈。基于这一范式，我们训练了一个轻量级的编排器ParaManager，它将规划决策与子任务解决解耦，实现状态感知的并行子任务分解、委托和异步执行。培训方面，我们采用两阶段的助理经理培训流程。它通过整合配备恢复机制的监督微调（SFT）轨迹提升鲁棒性，并进一步应用强化学习（RL），以实现任务成功率、协议合规性、多样性和推理效率之间的最佳平衡。实验表明，ParaManager在多个基准测试中表现出强劲的性能，并在未见模型池下展现出强健的泛化能力。

Web-Gewu: A Browser-Based Interactive Playground for Robot Reinforcement Learning

Web-Gewu：基于浏览器的机器人强化学习互动游乐场

Authors: Kaixuan Chen, Linqi Ye
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2604.17050
Pdf link: https://arxiv.org/pdf/2604.17050
Abstract With the rapid development of embodied intelligence, robotics education faces a dual challenge: high computational barriers and cumbersome environment configuration. Existing centralized cloud simulation solutions incur substantial GPU and bandwidth costs that preclude large-scale deployment, while pure local computing is severely constrained by learners' hardware limitations. To address these issues, we propose \href{this http URL}{Web-Gewu}, an interactive robotics education platform built on a WebRTC cloud-edge-client collaborative architecture. The system offloads all physics simulation and reinforcement learning (RL) training to the edge node, while the cloud server acts exclusively as a lightweight signaling relay, enabling extremely low-cost browser-based peer-to-peer (P2P) real-time streaming. Learners can interact with multi-form robots at low end-to-end latency directly in a web browser without any local installation, and simultaneously observe real-time visualization of multi-dimensional monitoring data, including reinforcement learning reward curves. Combined with a predefined robust command communication protocol, Web-Gewu provides a highly scalable, out-of-the-box, and barrier-free teaching infrastructure for embodied intelligence, significantly lowering the barrier to entry for cutting-edge robotics technology.
中文摘要 随着具身智能的快速发展，机器人教育面临双重挑战：高计算障碍和繁琐的环境配置。现有的集中式云仿真解决方案会产生大量的GPU和带宽成本，难以大规模部署，而纯本地计算则受限于学习者的硬件限制。为解决这些问题，我们提出了\href{此 http URL}{Web-Gewu}，这是一个基于WebRTC云端-客户端协作架构构建的互动机器人教育平台。系统将所有物理仿真和强化学习（RL）训练卸载到边缘节点，而云服务器则专门作为轻量级信令中继，实现极低成本的基于浏览器的点对点（P2P）实时流传输。学习者可以在网页浏览器中以低端到端延迟直接与多形态机器人交互，无需本地安装，同时实时观察多维监控数据的可视化，包括强化学习奖励曲线。结合预定义的稳健指令通信协议，Web-Gewu为具身智能提供了高度可扩展、开箱即用且无障碍的教学基础设施，显著降低了尖端机器人技术的进入门槛。

Live LTL Progress Tracking: Towards Task-Based Exploration

实时LTL进展追踪：迈向基于任务的探索

Authors: Noel Brindise, Cedric Langbort, Melkior Ornik
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2604.17106
Pdf link: https://arxiv.org/pdf/2604.17106
Abstract Motivated by the challenge presented by non-Markovian objectives in reinforcement learning (RL), we present a novel framework to track and represent the progress of autonomous agents through complex, multi-stage tasks. Given a specification in finite linear temporal logic (LTL), the framework establishes a 'tracking vector' which updates at each time step in a trajectory rollout. The values of the vector represent the status of the specification as the trajectory develops, assigning true, false, or 'open' labels (where 'open' is used for indeterminate cases). Applied to an LTL formula tree, the tracking vector can be used to encode detailed information about how a task is executed over a trajectory, providing a potential tool for new performance metrics, diverse exploration, and reward shaping. In this paper, we formally present the framework and algorithm, collectively named Live LTL Progress Tracking, give a simple working example, and demonstrate avenues for its integration into RL models. Future work will apply the framework to problems such as task-space exploration and diverse solution-finding in RL.
中文摘要 受强化学习（RL）中非马尔可夫目标挑战的激励，我们提出了一种新框架，用于跟踪和表示自主智能体在复杂多阶段任务中的进展。给定有限线性时间逻辑（LTL）中的规范，该框架建立了一个“跟踪向量”，在轨迹展开的每个时间步更新。向量的值代表规范在轨迹发展过程中的状态，分配真、假或“开放”标签（其中“开放”用于不确定的情况）。应用于LTL公式树时，跟踪向量可用于编码任务在轨迹上的执行详细信息，为新的绩效指标、多样化探索和奖励塑造提供潜在工具。本文正式介绍了该框架和算法，统称为实时LTL进展跟踪，给出一个简单的工作示例，并展示了将其集成到强化学习模型中的途径。未来的工作将将该框架应用于任务空间探索和强化学习中的多样化解决方案寻找等问题。

Do LLM-derived graph priors improve multi-agent coordination?

LLM导出的图先验是否能改善多智能体协调？

Authors: Nikunj Gupta, Rajgopal Kannan, Viktor Prasanna
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2604.17191
Pdf link: https://arxiv.org/pdf/2604.17191
Abstract Multi-agent reinforcement learning (MARL) is crucial for AI systems that operate collaboratively in distributed and adversarial settings, particularly in multi-domain operations (MDO). A central challenge in cooperative MARL is determining how agents should coordinate: existing approaches must either hand-specify graph topology, rely on proximity-based heuristics, or learn structure entirely from environment interaction; all of which are brittle, semantically uninformed, or data-intensive. We investigate whether large language models (LLMs) can generate useful coordination graph priors for MARL by using minimal natural language descriptions of agent observations to infer latent coordination patterns. These priors are integrated into MARL algorithms via graph convolutional layers within a graph neural network (GNN)-based pipeline, and evaluated on four cooperative scenarios from the Multi-Agent Particle Environment (MPE) benchmark against baselines spanning the full spectrum of coordination modeling, from independent learners to state-of-the-art graph-based methods. We further ablate across five compact open-source LLMs to assess the sensitivity of prior quality to model choice. Our results provide the first quantitative evidence that LLM-derived graph priors can enhance coordination and adaptability in dynamic multi-agent environments, and demonstrate that models as small as 1.5B parameters are sufficient for effective prior generation.
中文摘要 多智能体强化学习（MARL）对于在分布式和对抗环境中协同运行的人工智能系统至关重要，尤其是在多域操作（MDO）中。协作式MARL的一个核心挑战是确定代理之间应如何协调：现有方法要么手工指定图拓扑，要么依赖基于邻近的启发式，或完全从环境交互中学习结构;这些都比较脆弱、语义上缺乏信息或数据密集型。我们研究大型语言模型（LLMs）是否能通过对代理观察的最小自然语言描述来推断潜在的协调模式，生成有用的MARL协调图先验。这些先验通过基于图神经网络（GNN）的流水线中的图卷积层整合进MARL算法，并在多智能体粒子环境（MPE）基准测试的四个协作场景下，基于涵盖协调建模全谱基线进行评估，从独立学习者到最先进的基于图的方法。我们进一步对五个紧凑开源LLM进行了切割，以评估先验质量对模型选择的敏感性。我们的结果首次定量证据表明，LLM导出的图先验能够提升动态多智能体环境中的协调和适应性，并证明仅1.5亿参数的模型即可有效生成前代。

Guardrails in Logit Space: Safety Token Regularization for LLM Alignment

Logit 空间中的护栏：安全令牌规范化以实现 LLM 对齐

Authors: Thong Bach, Truyen Tran
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2604.17210
Pdf link: https://arxiv.org/pdf/2604.17210
Abstract Fine-tuning well-aligned large language models (LLMs) on new domains often degrades their safety alignment, even when using benign datasets. Existing safety alignment techniques primarily focus on pretraining, leaving fine-tuned models vulnerable to behavioral shifts. In this work, we introduce safety token regularization (STR), a lightweight method designed to preserve safety properties during fine-tuning. Our approach identifies salient tokens from rejection templates of well-aligned models and constrains their associated logits during training, preventing the loss of critical safety behaviors. Unlike reinforcement learning or preference optimization methods, STR requires minimal additional computation and seamlessly integrates with parameter-efficient fine-tuning techniques such as LoRA. Comprehensive experiments demonstrate that our approach achieves safety performance on par with state-of-the-art methods, while preserving task-specific utility and requiring minimal implementation overhead. Furthermore, we show that safety token regularization enhances training stability and overall performance beyond safety considerations alone. This work offers a practical and readily deployable strategy for continual safety alignment in fine-tuned LLMs.
中文摘要 在新领域对高度对齐的大型语言模型（LLMs）进行微调，即使使用无害数据集，也常常会降低其安全性比对。现有的安全对齐技术主要侧重于预训练，使得经过精细调校的模型容易受到行为转变的影响。在本研究中，我们介绍了安全令牌正则化（STR），这是一种轻量级方法，旨在在微调过程中保持安全特性。我们的方法从良好对齐模型的拒绝模板中识别显著标记，并在训练过程中限制其关联的logit，防止关键安全行为的丧失。与强化学习或偏好优化方法不同，STR所需的额外计算极少，并且能够无缝集成参数高效的微调技术，如LoRA。综合实验表明，我们的方法在保持任务专用效用和最小实施开销的同时，实现了与最先进方法相当的安全性能。此外，我们证明安全令牌规范不仅能提升训练稳定性和整体表现，而不仅仅是安全因素。这项工作为微调大型语言模型中持续的安全对齐提供了实用且易于部署的策略。

Beyond "I Don't Know": Evaluating LLM Self-Awareness in Discriminating Data and Model Uncertainty

超越“我不知道”：评估大语言模型在辨别数据和模型不确定性的自我意识

Authors: Jingyi Ren, Ante Wang, Yunghwei Lai, Xiaolong Wang, Linlu Gong, Weitao Li, Weizhi Ma, Yang Liu
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2604.17293
Pdf link: https://arxiv.org/pdf/2604.17293
Abstract Reliable Large Language Models (LLMs) should abstain when confidence is insufficient. However, prior studies often treat refusal as a generic "I don't know'', failing to distinguish input-level ambiguity (data uncertainty) from capability limitations (model uncertainty). This lack of distinction limits downstream action decisions like requesting clarification or invoking external tools. In this work, we introduce UA-Bench, a benchmark of over 3,500 questions drawn from six datasets spanning knowledge-intensive and reasoning-intensive tasks, designed to evaluate explicit uncertainty attribution. An evaluation of 18 frontier LLMs shows that even state-of-the-art models struggle to reliably discriminate between data uncertainty and model uncertainty, and that high answer accuracy does not necessarily imply strong uncertainty attribution ability. To narrow this gap, we propose a lightweight data synthesis and reinforcement learning strategy. Experiments on both Qwen3-4B-Instruct-2507 and Qwen3-8B in thinking mode show that the proposed method improves uncertainty attribution while preserving answer accuracy. Our code and data are publicly available now.
中文摘要 当信心不足时，可靠的大型语言模型（LLMs）应避免使用。然而，以往的研究常将拒绝视为一种通用的“我不知道”，未能区分输入层面的歧义（数据不确定性）与能力限制（模型不确定性）。这种缺乏区分限制了后续的行动决策，比如请求澄清或调用外部工具。在本研究中，我们介绍了UA-Bench，这是一个由六个数据集中提取的3500多个问题组成的基准测试，涵盖知识密集型和推理型任务，旨在评估显性不确定性归因。对18个前沿LLMs的评估显示，即使是最先进的模型也难以可靠区分数据不确定性和模型不确定性，且高答案准确率并不一定意味着强烈的不确定性归因能力。为缩小这一差距，我们提出了一种轻量级数据综合与强化学习策略。在Qwen3-4B-Instruct-2507和Qwen3-8B的思维模式下的实验显示，所提方法在保持答案准确性的同时，提高了不确定性归因。我们的代码和数据现已公开。

A Survey of Reinforcement Learning for Large Language Models under Data Scarcity: Challenges and Solutions

数据稀缺下大型语言模型强化学习概览：挑战与解决方案

Authors: Zhiyin Yu, Yuchen Mou, Juncheng Yan, Junyu Luo, Chunchun Chen, Xing Wei, Yunhui Liu, Hongru Sun, Yuxing Zhang, Jun Xu, Yatao Bian, Ming Zhang, Wei Ye, Tieke He, Jie Yang, Guanjie Zheng, Zhonghai Wu, Bo Zhang, Lei Bai, Xiao Luo
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2604.17312
Pdf link: https://arxiv.org/pdf/2604.17312
Abstract Reinforcement learning (RL) has emerged as a powerful post-training paradigm for enhancing the reasoning capabilities of large language models (LLMs). However, reinforcement learning for LLMs faces substantial data scarcity challenges, including the limited availability of high-quality external supervision and the constrained volume of model-generated experience. These limitations make data-efficient reinforcement learning a critical research direction. In this survey, we present the first systematic review of reinforcement learning for LLMs under data scarcity. We propose a bottom-up hierarchical framework built around three complementary perspectives: the data-centric perspective, the training-centric perspective, and the framework-centric perspective. We develop a taxonomy of existing methods, summarize representative approaches in each category, and analyze their strengths and limitations. Our taxonomy aims to provide a clear conceptual foundation for understanding the design space of data-efficient RL for LLMs and to guide researchers working in this emerging area. We hope this survey offers a comprehensive roadmap for future research and inspires new directions toward more efficient and scalable reinforcement learning post-training for LLMs.
中文摘要 强化学习（RL）已成为增强大型语言模型（LLM）推理能力的强大训练后范式。然而，LLM的强化学习面临着重大的数据稀缺性挑战，包括高质量外部监督有限以及模型生成经验的数量有限。这些局限性使得数据高效的强化学习成为关键的研究方向。本综述首次呈现数据稀缺下LLM强化学习的系统综述。我们提出了一个自下而上的层级框架，围绕三种互补视角构建：以数据为中心的视角、以培训为中心的视角和以框架为中心的视角。我们对现有方法进行分类，总结各类别的代表性方法，并分析其优势与局限性。我们的分类法旨在为理解大型语言模型（LLM）数据高效强化学习设计空间提供清晰的概念基础，并为该新兴领域的研究人员提供指导。我们希望这项调查为未来研究提供全面的路线图，并激励人们朝着更高效、更具可扩展性的大型语言模型训练后强化学习的新方向发展。

Rethinking the Comparison Unit in Sequence-Level Reinforcement Learning: An Equal-Length Paired Training Framework from Loss Correction to Sample Construction

重新思考序列级强化学习中的比较单元：从损失纠正到样本构建的等长配对训练框架

Authors: Fei Ding, Yongkang Zhang, Runhao Liu, Yuhao Liao, Zijian Zeng, Huiming Yang, Sibo wang, Linglin Liao
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2604.17328
Pdf link: https://arxiv.org/pdf/2604.17328
Abstract This paper investigates the length problem in sequence-level relative reinforcement learning. We observe that, although existing methods partially alleviate length-related phenomena, a more fundamental issue remains insufficiently characterized: the comparison units used during training lack inherent comparability. Building on this observation, we propose a new perspective: the length problem should not be viewed merely as a loss-scaling or normalization bias, but rather as a \emph{comparison unit construction} problem. We further establish a sample-construction-based training framework that, instead of applying post-hoc corrections to unequal-length responses, proactively constructs equal-length, alignable, and comparable training segments during generation. Within this framework, we propose EqLen, a concrete method applicable to group-relative comparison algorithms such as GRPO, GSPO, and RLOO. Through dual-track synchronous generation, prefix inheritance, and segment masking, EqLen efficiently collects effective equal-length training segments and enables stable
中文摘要 本文探讨了序列层相对强化学习中的长度问题。我们观察到，尽管现有方法部分缓解了长度相关现象，但一个更根本的问题仍未充分表征：训练中使用的比较单元缺乏固有的可比性。基于这一观察，我们提出了一个新视角：长度问题不应仅仅视为损失缩放或归一化偏差，而应视为一个\emph{比较单元构造}问题。我们进一步建立了基于样本构建的训练框架，该框架在生成过程中主动构建等长、可比对且可比的训练片段，而不是事后纠正不等长度的响应。在此框架下，我们提出了EqLen，这是一种适用于GRPO、GSPO和RLOO等群体相对比较算法的具体方法。通过双轨同步生成、前缀继承和段遮罩，EqLen 高效收集等长训练段，实现稳定

Learning Whole-Body Humanoid Locomotion via Motion Generation and Motion Tracking

通过运动生成和运动追踪学习全身类人运动

Authors: Zewei Zhang, Kehan Wen, Michael Xu, Junzhe He, Chenhao Li, Takahiro Miki, Clemens Schwarke, Chong Zhang, Xue Bin Peng, Marco Hutter
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2604.17335
Pdf link: https://arxiv.org/pdf/2604.17335
Abstract Whole-body humanoid locomotion is challenging due to high-dimensional control, morphological instability, and the need for real-time adaptation to various terrains using onboard perception. Directly applying reinforcement learning (RL) with reward shaping to humanoid locomotion often leads to lower-body-dominated behaviors, whereas imitation-based RL can learn more coordinated whole-body skills but is typically limited to replaying reference motions without a mechanism to adapt them online from perception for terrain-aware locomotion. To address this gap, we propose a whole-body humanoid locomotion framework that combines skills learned from reference motions with terrain-aware adaptation. We first train a diffusion model on retargeted human motions for real-time prediction of terrain-aware reference motions. Concurrently, we train a whole-body reference tracker with RL using this motion data. To improve robustness under imperfectly generated references, we further fine-tune the tracker with a frozen motion generator in a closed-loop setting. The resulting system supports directional goal-reaching control with terrain-aware whole-body adaptation, and can be deployed on a Unitree G1 humanoid robot with onboard perception and computation. The hardware experiments demonstrate successful traversal over boxes, hurdles, stairs, and mixed terrain combinations. Quantitative results further show the benefits of incorporating online motion generation and fine-tuning the motion tracker for improved generalization and robustness.
中文摘要 全身类人机动具有挑战性，原因是高维控制、形态不稳定性以及需要通过机载感知实现对各种地形的实时适应。直接将强化学习（RL）与奖励塑造应用于类人运动，通常会导致下肢主导行为，而基于模仿的强化学习则可以学习更协调的全身技能，但通常只能重复参考动作，缺乏从感知中在线适应的机制，以实现地形感知的运动。为弥补这一空白，我们提出了一个全身类人运动框架，结合参考运动学到的技能与地形感知适应能力。我们首先训练一个关于重新定向人类运动的扩散模型，以实时预测地形感知的参考运动。同时，我们利用这些运动数据训练一个全身参考跟踪器。为了在不完美生成的参考下提升鲁棒性，我们在闭环设置下用冻结的运动发生器进一步微调追踪器。该系统支持方向目标达标控制，具备地形感知的全身适应能力，可部署于具备机载感知和计算功能的Unitree G1人形机器人上。硬件实验展示了成功穿越箱子、障碍、楼梯和混合地形组合的能力。定量结果进一步显示，结合在线动作生成和微调运动追踪器以提升泛化性和稳健性，具有优势。

AutoSearch: Adaptive Search Depth for Efficient Agentic RAG via Reinforcement Learning

AutoSearch：通过强化学习实现高效代理RAG的自适应搜索深度

Authors: Jingbo Sun, Wenyue Chong, Songjun Tu, Qichao Zhang, Yaocheng Zhang, Jiajun Chai, Xiaohan Wang, Wei Lin, Guojun Yin, Dongbin Zhao
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2604.17337
Pdf link: https://arxiv.org/pdf/2604.17337
Abstract Agentic retrieval-augmented generation (RAG) systems enable large language models (LLMs) to solve complex tasks through multi-step interaction with external retrieval tools. However, such multi-step interaction often involves redundant search steps, incurring substantial computational cost and latency. Prior work limits search depth (i.e., the number of search steps) to reduce cost, but this often leads to underexploration of complex questions. To address this, we first investigate how search depth affects accuracy and find a minimal sufficient search depth that defines an accuracy-efficiency trade-off, jointly determined by question complexity and the agent's capability. Furthermore, we propose AutoSearch, a reinforcement learning (RL) framework that evaluates each search step via self-generated intermediate answers. By a self-answering mechanism, AutoSearch identifies the minimal sufficient search depth and promotes efficient search by rewarding its attainment while penalizing over-searching. In addition, reward mechanisms are introduced to stabilize search behavior and improve answer quality on complex questions. Extensive experiments on multiple benchmarks show that AutoSearch achieves a superior accuracy-efficiency trade-off, alleviating over-searching while preserving search quality.
中文摘要 代理检索增强生成（RAG）系统使大型语言模型（LLMs）能够通过多步交互解决复杂任务。然而，这种多步交互通常涉及冗余的搜索步骤，导致计算成本和延迟增加。以往的工作限制了搜索深度（即搜索步骤的数量）以降低成本，但这常常导致对复杂问题的探讨不足。为此，我们首先研究搜索深度如何影响准确性，并找到一个最小的足够搜索深度，以定义由问题复杂度和智能体能力共同决定的准确性与效率权衡。此外，我们提出了AutoSearch，一种强化学习（RL）框架，通过自我生成的中间答案评估每个搜索步骤。通过自答机制，自动搜索识别最小充分搜索深度，并通过奖励其实现而惩罚过度搜索，促进高效搜索。此外，还引入了奖励机制以稳定搜索行为并提升复杂问题的答案质量。对多个基准测试的广泛实验表明，自动搜索在准确性与效率之间的权衡中实现了优越，既缓解了过度搜索，又保持了搜索质量。

RISC-V Functional Safety for Autonomous Automotive Systems: An Analytical Framework and Research Roadmap for ML-Assisted Certification

RISC-V 自动驾驶汽车系统功能安全：机器学习辅助认证的分析框架与研究路线图

Authors: Nick Andreasyan, Mikhail Struve, Alexey Popov, Maksim Nikolaev, Vadim Vashkelis
Subjects: Subjects: Software Engineering (cs.SE); Hardware Architecture (cs.AR); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2604.17391
Pdf link: https://arxiv.org/pdf/2604.17391
Abstract RISC-V is emerging as a viable platform for automotive-grade embedded computing, with recent ISO 26262 ASIL-D certifications demonstrating readiness for safety-critical deployment in autonomous driving systems. However, functional safety in automotive systems is fundamentally a certification problem rather than a processor problem. The dominant costs arise from diagnostic coverage analysis, toolchain qualification, fault injection campaigns, safety-case generation, and compliance with ISO 26262, ISO 21448 (SOTIF), and ISO/SAE 21434. This paper analyzes the role of RISC-V in automotive functional safety, focusing on ISA openness, formal verifiability, custom extension control, debug transparency, and vendor-independent qualification. We examine autonomous driving safety requirements and map them to RISC-V architectural challenges such as lockstep execution, safety islands, mixed-criticality isolation, and secure debug. Rather than proposing a single algorithmic breakthrough, we present an analytical framework and research roadmap centered on certification economics as the primary optimization objective. We also discuss how selected ML methods, including LLM-assisted FMEDA generation, knowledge-graph-based safety case automation, reinforcement learning for fault injection, and graph neural networks for diagnostic coverage, can support certification workflows. We argue that the strongest outcome is not a faster core, but an ASIL-D-ready certifiable RISC-V platform.
中文摘要 RISC-V正逐渐成为汽车级嵌入式计算的可行平台，近期获得的ISO 26262 ASIL-D认证显示其已准备好在自动驾驶系统中执行安全关键部署。然而，汽车系统中的功能安全本质上是认证问题，而非处理器问题。主要成本来自诊断覆盖分析、工具链鉴定、故障注入活动、安全案例生成，以及符合ISO 26262、ISO 21448（SOTIF）和ISO/SAE 21434。本文分析了RISC-V在汽车功能安全中的作用，重点关注ISA开放性、形式化可验证性、自定义扩展控制、调试透明度以及厂商无关资格认证。我们分析自动驾驶安全需求，并将其映射到RISC-V架构挑战，如锁步执行、安全岛、混合关键隔离和安全调试等。我们不提出单一的算法突破，而是提出一个以认证经济学为主要优化目标的分析框架和研究路线图。我们还讨论了部分机器学习方法，包括LLM辅助FMEDA生成、基于知识图谱的安全案例自动化、故障注入的强化学习以及诊断覆盖的图神经网络，如何支持认证工作流程。我们认为，最强的结果不是更快的核心，而是一个ASIL-D支持的认证RISC-V平台。

Think before Go: Hierarchical Reasoning for Image-goal Navigation

三思而后行：图像目标导航的层级推理

Authors: Pengna Li, Kangyi Wu, Shaoqing Xu, Fang Li, Lin Zhao, Long Chen, Zhi-Xin Yang, Nanning Zheng
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2604.17407
Pdf link: https://arxiv.org/pdf/2604.17407
Abstract Image-goal navigation steers an agent to a target location specified by an image in unseen environments. Existing methods primarily handle this task by learning an end-to-end navigation policy, which compares the similarities of target and observation images and directly predicts the actions. However, when the target is distant or lies in another room, such methods fail to extract informative visual cues, leading the agent to wander around. Motivated by the human cognitive principle that deliberate, high-level reasoning guides fast, reactive execution in complex tasks, we propose Hierarchical Reasoning Navigation (HRNav), a framework that decomposes image-goal navigation into high-level planning and low-level execution. In high-level planning, a vision-language model is trained on a self-collected dataset to generate a short-horizon plan, such as whether the agent should walk through the door or down the hallway. This downgrades the difficulty of the long-horizon task, making it more amenable to the execution part. In low-level execution, an online reinforcement learning policy is utilized to decide actions conditioned on the short-horizon plan. We also devise a novel Wandering Suppression Penalty (WSP) to further reduce the wandering problem. Together, these components form a hierarchical framework for Image-Goal Navigation. Extensive experiments in both simulation and real-world environments demonstrate the superiority of our method.
中文摘要 图像目标导航引导代理前往图像指定的目标位置，且在看不见的环境中。现有方法主要通过学习端到端导航策略来完成这一任务，该策略比较目标图像与观测图像的相似性并直接预测其动作。然而，当目标距离较远或位于另一个房间时，这些方法无法提取有用的视觉线索，导致代理在房间里游走。基于人类认知原则，即有意识的高级推理指导复杂任务中的快速、反应性执行，我们提出了层级推理导航（HRNav）框架，将图像目标导航分解为高层规划和低层执行。在高层次规划中，视觉语言模型会在自收集的数据集上训练，生成短期规划，比如代理人是走进门还是走过走廊。这降低了长期任务的难度，使其更适合执行。在低层执行中，在线强化学习策略被用来决定基于短期计划的行动。我们还设计了一种新的“流浪抑制惩罚”（WSP），以进一步减少流浪问题。这些组件共同构成了图像-目标导航的层级框架。在模拟和现实环境中的广泛实验证明了我们方法的优越性。

TrafficClaw: Generalizable Urban Traffic Control via Unified Physical Environment Modeling

TrafficClaw：通过统一物理环境建模实现的可通用城市交通控制

Authors: Siqi Lai, Pan Zhang, Yuping Zhou, Jindong Han, Yansong Ning, Hao Liu
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2604.17456
Pdf link: https://arxiv.org/pdf/2604.17456
Abstract Urban traffic control is a system-level coordination problem spanning heterogeneous subsystems, including traffic signals, freeways, public transit, and taxi services. Existing optimization-based, reinforcement learning (RL), and emerging LLM-based approaches are largely designed for isolated tasks, limiting both cross-task generalization and the ability to capture coupled physical dynamics across subsystems. We argue that effective system-level control requires a unified physical environment in which subsystems share infrastructure, mobility demand, and spatiotemporal constraints, allowing local interventions to propagate through the network. To this end, we propose TrafficClaw, a framework for general urban traffic control built upon a unified runtime environment. TrafficClaw integrates heterogeneous subsystems into a shared dynamical system, enabling explicit modeling of cross-subsystem interactions and closed-loop agent-environment feedback. Within this environment, we develop an LLM agent with executable spatiotemporal reasoning and reusable procedural memory, supporting unified diagnostics across subsystems and continual strategy refinement. Furthermore, we introduce a multi-stage training pipeline with supervised initialization and agentic RL with system-level optimization, further enabling coordinated and system-aware performance. Experiments demonstrate that TrafficClaw achieves robust, transferable, and system-aware performance across unseen traffic scenarios, dynamics, and task configurations. Our project is available at this https URL.
中文摘要 城市交通控制是一种系统层面的协调问题，涉及多个异构子系统，包括交通信号、高速公路、公共交通和出租车服务。现有的基于优化的强化学习（RL）和新兴的基于大型语言模型的方法大多为孤立任务设计，限制了跨任务的泛化以及跨子系统捕捉耦合物理动态的能力。我们认为，有效的系统级控制需要一个统一的物理环境，其中子系统共享基础设施、移动需求和时空约束，使局部干预能够在网络中传播。为此，我们提出了TrafficClaw，一个基于统一运行环境的通用城市交通控制框架。TrafficClaw 将异构子系统集成到共享的动态系统中，实现跨子系统交互的显式建模和闭环代理-环境反馈。在此环境中，我们开发了具备可执行时空推理和可重复使用程序记忆的LLM代理，支持跨子系统统一诊断和持续策略优化。此外，我们引入了多阶段训练流水线，配备监督初始化和智能强化学习，并实现系统级优化，进一步实现协调和系统感知性能。实验表明，TrafficClaw在未见的交通场景、动态和任务配置中实现了稳健、可迁移且具系统感知的性能。我们的项目可在此 https 网址访问。

Waking Up Blind: Cold-Start Optimization of Supervision-Free Agentic Trajectories for Grounded Visual Perception

《盲醒醒来：监督冷启动优化——无行动轨迹以实现视觉感知的扎根

Authors: Ashutosh Bajpai, Tamal Majumder, Akshay Nambi, Tanmoy Chakraborty
Subjects: Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2604.17475
Pdf link: https://arxiv.org/pdf/2604.17475
Abstract Small Vision-Language Models (SVLMs) are efficient task controllers but often suffer from visual brittleness and poor tool orchestration. They typically require expensive supervised trajectory tuning to mitigate these deficits. In this work, we propose Self-supervised Perception Enabled by Cascaded Tool Rollout Alignment (SPECTRA), a supervision-free framework that bootstraps agentic capabilities via Coldstart Reinforcement Learning for SVLMs. SPECTRA enforces Soft Structured Multi-turn Rollouts, a topological constraint that directs agents to explicitly sequence tool derived evidence before synthesis, effectively grounding reasoning in visual observations. We employ a multi-objective reward signal that simultaneously maximizes task correctness, rollout structure, and tool utility, enabling agent to self-discover robust behaviors without human preference labels. We further introduce Tool Instrumental Utility (TIU), a novel metric to quantify tool efficacy in the absence of ground truth. Extensive evaluations across composite and out-of-distribution (MMMU-Pro) benchmarks demonstrate that SPECTRA boosts agentic trajectories, improving task accuracy by up to 5% and tool efficiency by 9%, enabling more efficient multimodal agents that learn effectively from environmental interaction alone.
中文摘要 小型视觉语言模型（SVLMs）是高效的任务控制器，但常常存在视觉脆弱性和工具编排不佳的问题。它们通常需要昂贵且有监督的轨迹调校来缓解这些缺陷。本研究提出由级联工具展开对齐实现的自监督感知（SPECTRA），这是一个无监督框架，通过冷启动强化学习为SVLMs启动智能体能力。SPECTRA强制软结构化多回合滚动，这是一种拓扑约束，指示代理在综合前明确排序工具衍生证据，有效地将推理基于视觉观察。我们采用多目标奖励信号，同时最大化任务正确性、展开结构和工具实用性，使代理能够在不依赖人类偏好标签的情况下自我发现稳健行为。我们还进一步介绍了工具工具效用（TIU），这是一种在缺乏实地信息的情况下量化工具效能的新指标。综合和非分发（MMMU-Pro）基准测试的广泛评估表明，SPECTRA提升了智能体轨迹，任务准确率提升了最多5%，工具效率提升了9%，使多模态智能体能够通过环境相互作用有效学习。

RS-HyRe-R1: A Hybrid Reward Mechanism to Overcome Perceptual Inertia for Remote Sensing Images Understanding

RS-HyRe-R1：一种混合奖励机制，用于克服感知惯性以实现遥感图像理解

Authors: Gaozhi Zhou, Hu He, Peng Shen, Jipeng Zhang, Liujue Zhang, Linrui Xu, Zeyuan Wang, Ziyu Li, Xuezhi Cui, Wang Guo, Haifeng Li
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2604.17504
Pdf link: https://arxiv.org/pdf/2604.17504
Abstract Reinforcement learning (RL) post-training substantially improves remote sensing vision-language models (RS-VLMs). However, when handling complex remote sensing imagery (RSI) requiring exhaustive visual scanning, models tend to rely on localized salient cues for rapid inference. We term this RL-induced bias "perceptual inertia". Driven by reward maximization, models favor quick outcome fitting, leading to two limitations: cognitively, overreliance on specific features impedes complete evidence construction; operationally, models struggle to flexibly shift visual focus across tasks. To address this bias and encourage comprehensive visual evidence mining, we propose RS-HyRe-R1, a hybrid reward framework for RSI understanding. It introduces: (1) a spatial reasoning activation reward that enforces structured visual reasoning; (2) a perception correctness reward that provides adaptive quality anchors across RS tasks, ensuring accurate geometric and semantic alignment; and (3) a visual-semantic path evolution reward that penalizes repetitive reasoning and promotes exploration of complementary cues to build richer evidence chains. Experiments show RS-HyRe-R1 effectively mitigates "perceptual inertia", encouraging deeper, more diverse reasoning. With only 3B parameters, it achieves state-of-the-art performance on REC, OVD, and VQA tasks, outperforming models up to 7B parameters. It also demonstrates strong zero-shot generalization, surpassing the second-best model by 3.16%, 3.97%, and 2.72% on VQA, OVD, and REC, respectively. Code and datasets are available at this https URL.
中文摘要 强化学习（RL）的后训练显著提升了遥感视觉语言模型（RS-VLM）。然而，在处理需要详尽视觉扫描的复杂遥感图像（RSI）时，模型倾向于依赖局部显著线索进行快速推断。我们称这种强化语言引起的偏向为“感知惯性”。受奖励最大化驱动，模型倾向于快速拟合结果，导致两个局限：认知上，过度依赖特定特征阻碍了完整的证据构建;在操作层面，模型难以灵活地在不同任务之间转移视觉焦点。为解决这一偏见并鼓励全面的视觉证据挖掘，我们提出了RS-HyRe-R1，一种用于理解RSI的混合奖励框架。它引入了：（1）空间推理激活奖励，强化结构化的视觉推理;（2）感知正确性奖励，在RS任务中提供自适应质量锚点，确保几何和语义的准确对齐;以及（3）视觉语义路径进化奖励，惩罚重复推理，促进对互补线索的探索，以构建更丰富的证据链。实验显示，RS-HyRe-R1有效缓解了“感知惯性”，鼓励更深层次、更多样化的推理。仅用3B参数，它在REC、OVD和VQA任务中实现了最先进的性能，超过了7B参数下的模型。它还展现了强烈的零发推广能力，分别在VQA、OVD和REC上比第二好的模型高出3.16%、3.97%和2.72%。代码和数据集可在该 https URL 访问。

PoliLegalLM: A Technical Report on a Large Language Model for Political and Legal Affairs

PoliLegalLM：关于政治与法律事务大型语言模型的技术报告

Authors: Yuting Huang, Yinghao Hu, Qian Xiao, Wenlin Zhong, Yiquan Wu, Taishi Zhou, Moke Chen, Changlong Sun, Kun Kuang, Fei Wu
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2604.17543
Pdf link: https://arxiv.org/pdf/2604.17543
Abstract Large language models (LLMs) have achieved remarkable success in general-domain tasks, yet their direct application to the legal domain remains challenging due to hallucinated legal citations, incomplete knowledge coverage, and weak structured reasoning. To address these issues, we propose PoliLegalLM, a domain-specific large language model tailored for political and legal applications. Our approach adopts a unified training framework that integrates continued pretraining, progressive supervised fine-tuning, and preference-based reinforcement learning to jointly enhance legal knowledge grounding, task alignment, and reasoning capability. We construct a large-scale, high-quality legal corpus and design a structured post-training pipeline, enabling the model to effectively learn domain-specific knowledge and adapt to diverse legal tasks. We evaluate PoliLegalLM on three representative benchmarks, including LawBench, LexEval, and a real-world dataset, PoliLegal. Experimental results demonstrate that PoliLegalLM achieves strong and consistent performance, outperforming competitive models of similar scale and remaining highly competitive with significantly larger models, while achieving the best results on real-world legal scenarios. These results highlight the effectiveness of our training paradigm and the practical value of domain-specific LLMs for real-world legal applications.
中文摘要 大型语言模型（LLMs）在通用领域任务中取得了显著成功，但由于法律引用多出错、知识覆盖不完整以及结构推理薄弱，其直接应用于法律领域仍具挑战性。为解决这些问题，我们提出了PoliLegalLM，一个专门用于政治和法律应用的领域专用大型语言模型。我们的方法采用统一的培训框架，整合持续的预培训、渐进式监督微调和基于偏好的强化学习，共同提升法律知识的扎实、任务对齐和推理能力。我们构建大规模、高质量的法律语料库，并设计结构化的培训后流程，使模型能够有效学习领域特定知识并适应多样化的法律任务。我们基于三个代表性基准测试PoliLegalLM，包括LawBench、LexEval和一个真实世界数据集PoliLegal。实验结果表明，PoliLegalLM实现了强大且稳定的性能，优于同等规模的竞争模型，并且在面对更大模型时依然保持高度竞争力，同时在真实法律场景中取得最佳成绩。这些结果凸显了我们训练范式的有效性以及领域特定大型语言模型在现实法律应用中的实际价值。

SVL: Goal-Conditioned Reinforcement Learning as Survival Learning

SVL：目标条件强化学习作为生存学习

Authors: Franki Nguimatsia Tiofack, Fabian Schramm, Théotime Le Hellard, Justin Carpentier
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2604.17551
Pdf link: https://arxiv.org/pdf/2604.17551
Abstract Standard approaches to goal-conditioned reinforcement learning (GCRL) that rely on temporal-difference learning can be unstable and sample-inefficient due to bootstrapping. While recent work has explored contrastive and supervised formulations to improve stability, we present a probabilistic alternative, called survival value learning (SVL), that reframes GCRL as a survival learning problem by modeling the time-to-goal from each state as a probability distribution. This structured distributional Monte Carlo perspective yields a closed-form identity that expresses the goal-conditioned value function as a discounted sum of survival probabilities, enabling value estimation via a hazard model trained via maximum likelihood on both event and right-censored trajectories. We introduce three practical value estimators, including finite-horizon truncation and two binned infinite-horizon approximations to capture long-horizon objectives. Experiments on offline GCRL benchmarks show that SVL combined with hierarchical actors matches or surpasses strong hierarchical TD and Monte Carlo baselines, excelling on complex, long-horizon tasks.
中文摘要 依赖时间差分学习的标准目标条件强化学习（GCRL）方法由于自助法（bootstrapping）可能不稳定且样本效率低下。虽然近期研究探索了对比和监督式表述以提升稳定性，但我们提出了一种概率替代方案，称为生存价值学习（SVL），通过将每个状态的目标到达时间建模为概率分布，将GCRL重新框定为生存学习问题。这种结构化分布的蒙特卡洛视角给出了一个封闭形式恒等式，将目标条件值函数表达为存活概率的折现和，从而通过通过对事件和右遮蔽轨迹进行最大似然训练的危险模型进行价值估计。我们引入了三种实用的值估计方法，包括有限视界截断和两种分箱无限视界近似，以捕捉长视野目标。离线GCRL基准测试显示，SVL结合分层行为者可匹配甚至超越强分层TD和蒙特卡洛基线，在复杂且长视野的任务中表现出色。

COSEARCH: Joint Training of Reasoning and Document Ranking via Reinforcement Learning for Agentic Search

COSEARCH：通过强化学习进行推理与文档排序的联合训练，用于代理搜索

Authors: Hansi Zeng, Liam Collins, Bhuvesh Kumar, Neil Shah, Hamed Zamani
Subjects: Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)
Arxiv link: https://arxiv.org/abs/2604.17555
Pdf link: https://arxiv.org/pdf/2604.17555
Abstract Agentic search -- the task of training agents that iteratively reason, issue queries, and synthesize retrieved information to answer complex questions -- has achieved remarkable progress through reinforcement learning (RL). However, existing approaches such as Search-R1, treat the retrieval system as a fixed tool, optimizing only the reasoning agent while the retrieval component remains unchanged. A preliminary experiment reveals that the gap between an oracle and a fixed retrieval system reaches up to +26.8% relative F1 improvement across seven QA benchmarks, suggesting that the retrieval system is a key bottleneck in scaling agentic search performance. Motivated by this finding, we propose CoSearch, a framework that jointly trains a multi-step reasoning agent and a generative document ranking model via Group Relative Policy Optimization (GRPO). To enable effective GRPO training for the ranker -- whose inputs vary across reasoning trajectories -- we introduce a semantic grouping strategy that clusters sub-queries by token-level similarity, forming valid optimization groups without additional rollouts. We further design a composite reward combining ranking quality signals with trajectory-level outcome feedback, providing the ranker with both immediate and long-term learning signals. Experiments on seven single-hop and multi-hop QA benchmarks demonstrate consistent improvements over strong baselines, with ablation studies validating each design choice. Our results show that joint training of the reasoning agent and retrieval system is both feasible and strongly performant, pointing to a key ingredient for future search agents.
中文摘要 智能搜索——训练代理通过迭代推理、发出查询并综合检索到的信息来回答复杂问题的任务——通过强化学习（RL）取得了显著进展。然而，现有方法如Search-R1将检索系统视为固定工具，仅优化推理代理，检索部分保持不变。初步实验显示，oracle与固定检索系统之间的差距在七个质量保证基准中可达到+26.8%的相对F1提升，表明检索系统是调整代理搜索性能的关键瓶颈。基于这一发现，我们提出了CoSearch框架，该框架通过群体相对策略优化（GRPO）联合训练多步推理代理和生成文档排名模型。为了为排名者提供有效的GRPO训练——其输入在不同推理轨迹中变化——我们引入了一种语义分组策略，按令牌级相似度将子查询聚类，形成有效的优化组，无需额外展开。我们进一步设计了一种综合奖励，将排名质量信号与轨迹级结果反馈结合起来，为排名者提供即时和长期的学习信号。七个单跳和多跳质量保证基准测试的实验显示，优于强基线，消融研究验证了每一项设计选择。我们的结果表明，推理智能体与检索系统的联合训练既可行又性能强，这为未来搜索智能体提供了关键要素。

Poly-EPO: Training Exploratory Reasoning Models

多元EPO：探索性推理模型的训练

Authors: Ifdita Hasan Orney, Jubayer Ibn Hamid, Shreya S Ramanujam, Shirley Wu, Hengyuan Hu, Noah Goodman, Dorsa Sadigh, Chelsea Finn
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2604.17654
Pdf link: https://arxiv.org/pdf/2604.17654
Abstract Exploration is a cornerstone of learning from experience: it enables agents to find solutions to complex problems, generalize to novel ones, and scale performance with test-time compute. In this paper, we present a framework for post-training language models (LMs) that explicitly encourages optimistic exploration and promotes a synergy between exploration and exploitation. The central idea is to train the LM to generate sets of responses that are collectively accurate under the reward function and exploratory in their reasoning strategies. We first develop a general recipe for optimizing LMs with set reinforcement learning (set RL) under arbitrary objective functions, showing how standard RL algorithms can be adapted to this setting through a modification to the advantage computation. We then propose Polychromic Exploratory Policy Optimization (Poly-EPO), which instantiates this framework with an objective that explicitly synergizes exploration and exploitation. Across a range of reasoning benchmarks, we show that Poly-EPO improves generalization, as evidenced by higher pass@$k$ coverage, preserves greater diversity in model generations, and effectively scales with test-time compute.
中文摘要 探索是经验学习的基石：它使智能体能够找到复杂问题的解决方案，推广到新问题，并通过测试时计算来扩展性能。本文提出了一个训练后语言模型（LM）框架，明确鼓励乐观探索，促进探索与利用之间的协同效应。核心思想是训练LM生成一组在奖励函数下整体准确且在推理策略上具有探索性的反应。我们首先开发了在任意目标函数下通过集合强化学习（set RL）优化LM的通用方案，展示了如何通过修改优势计算来适应标准强化学习算法。随后，我们提出了多色探索性策略优化（Poly-EPO），该框架以明确协同探索与利用为目标实现。在多种推理基准测试中，我们证明Poly-EPO提升了泛化性，体现在更高的pass@$k美元覆盖率，保持了模型世代的更大多样性，并且通过测试时间计算实现了有效扩展。

OmniVLA-RL: A Vision-Language-Action Model with Spatial Understanding and Online RL

OmniVLA-RL：具备空间理解与在线强化学习的视觉-语言-行动模型

Authors: Haoxiang Jie, Yaoyuan Yan, Xiangyu Wei, Kailin Wang, Hongjie Yan, Zhiyou Heng, Daocheng Chen
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2604.17706
Pdf link: https://arxiv.org/pdf/2604.17706
Abstract Visual-Language-Action (VLA) models represent a paradigm shift in embodied AI, yet existing frameworks often struggle with imprecise spatial perception, suboptimal multimodal fusion, and instability in reinforcement learning. To bridge these gaps, we propose OmniVLA-RL, a novel architecture that leverages a Mix-of-Transformers (MoT) design to synergistically integrate reasoning, spatial, and action experts. Furthermore, we introduce Flow-GSPO, which reformulates flow matching as a Stochastic Differential Equation (SDE) process and integrates it with Group Segmented Policy Optimization (GSPO) to enhance action precision and training robustness. Extensive evaluations on the LIBERO and LIBERO-Plus benchmarks demonstrate that OmniVLA-RL significantly outperforms state-of-the-art methods, effectively overcoming the fundamental limitations of current VLA models.
中文摘要 视觉-语言-行动（VLA）模型代表了具身人工智能的范式转变，但现有框架常常面临空间感知不精确、多模态融合不优以及强化学习不稳定性的问题。为弥合这些空白，我们提出了OmniVLA-RL，一种利用混合变换器（MoT）设计协同整合推理、空间和行动专家的新型架构。此外，我们引入了Flow-GSPO，将流匹配重新表述为随机微分方程（SDE）过程，并与群组分段策略优化（GSPO）集成，以提升动作精度和训练鲁棒性。对LIBERO和LIBERO-Plus基准测试的广泛评估表明，OmniVLA-RL的表现显著优于最先进方法，有效克服了当前VLA模型的根本局限。

Tool Learning Needs Nothing More Than a Free 8B Language Model

工具学习只需要一个免费的8B语言模型

Authors: Chenming Tang, Hsiu-Yuan Huang, Weijie Liu, Junqiang Zheng, Saiyong Yang, Yunfang Wu
Subjects: Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2604.17739
Pdf link: https://arxiv.org/pdf/2604.17739
Abstract Reinforcement learning (RL) has become a prevalent paradigm for training tool calling agents, which typically requires online interactive environments. Existing approaches either rely on training data with ground truth annotations or require advanced commercial language models (LMs) to synthesize environments that keep fixed once created. In this work, we propose TRUSTEE, a data-free method training tool calling agents with dynamic environments fully simulated by free open-source LMs that can be as small as 8B, including task generation, user simulation, tool simulation and trajectory evaluation, paired with an adaptive curriculum learning mechanism that controls various aspects of the task difficulty dynamically during training. Our empirical results show that TRUSTEE brings consistent improvements across various domains and outperforms all the baselines which require extra external resources for training. These confirm that, with a sufficiently sophisticated design, even simulated environments with a local 8B LM as the backbone could set a strong baseline for tool learning, without expensive annotated data, realistic human interactions, executable tools or costly verifiable environments from human experts or commercial LMs. We hope our proposed paradigm could inspire future research on environment scaling with limited resources.
中文摘要 强化学习（RL）已成为训练工具调用代理的常用范式，这通常需要在线交互环境。现有方法要么依赖带有地面真实注释的训练数据，要么需要高级商业语言模型（LM）来综合环境，环境一旦创建就保持固定。在本研究中，我们提出了TRUSTEE方法，一种无数据的方法训练工具，调用具有动态环境的代理，这些环境由自由开源的LMs（最小可至8B）完全模拟，包括任务生成、用户仿真、工具仿真和轨迹评估，并配合自适应课程学习机制，在训练过程中动态控制任务难度的各个方面。我们的实证结果显示，TRUSTEE在多个领域持续带来改进，并且超越了所有需要额外外部资源培训的基线。这些数据证实，只要设计足够复杂，即使是以本地8B模型模型为骨干的模拟环境，也能为工具学习奠定坚实的基础，而无需昂贵的注释数据、真实的人工互动、可执行工具或来自人类专家或商业模型的昂贵可验证环境。我们希望我们提出的范式能激励未来在有限资源下进行环境尺度化的研究。

Input-Side Variance Suppression under Non-Normal Transient Amplification in Continuous-Control Reinforcement Learning

连续控制强化学习中非正态瞬态放大下的输入端方差抑制

Authors: Wu Yue
Subjects: Subjects: Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2604.17744
Pdf link: https://arxiv.org/pdf/2604.17744
Abstract Continuous-control reinforcement learning (RL) often exhibits large closed-loop variance, high-frequency control jitter, and sensitivity to disturbance injection. Existing explanations usually emphasize disturbance sources such as action noise, exploration perturbations, or policy nonsmoothness. This letter studies a complementary amplifier-side perspective: in nominally stable yet strongly non-normal closed loops, small input perturbations can undergo transient amplification and lead to disproportionately large state covariance. Motivated by this source--amplifier decomposition, we introduce an input-side variance suppression layer that operates between the learned policy and the plant input to reduce applied-input variance and step-to-step jitter. To separate mechanism from correlation, we use two control-theoretic interventions: one varies only eigenvector geometry under fixed eigenvalues and spectral radius, and the other varies only applied-input statistics under fixed strongly non-normal geometry. We then provide mechanism-consistent external validation on planar quadrotor tasks. Throughout, Koopman/ALE surrogates are used only as analysis and certification tools, not as direct performance paths. Taken together, the results support a narrower claim: in the studied settings, non-normal transient amplification is an important and under-emphasized contributor to execution-time closed-loop variance, and source-side suppression can reduce downstream covariance without changing the structural peak gain.
中文摘要 连续控制强化学习（RL）通常表现出较大的闭环方差、高频控制抖动以及对干扰注入的敏感性。现有解释通常强调干扰源，如作用噪声、探索扰动或策略非平滑性。本信研究了一个互补的放大器侧视角：在名义稳定但强非正规闭环中，小输入扰动可能经历瞬态放大，导致状态协方差异常大。基于这一源——放大器分解，我们引入了一个输入端方差抑制层，该层在学习策略与工厂输入之间工作，以减少应用输入方差和步间抖动。为了区分机制与相关性，我们采用两种控制理论干预：一种仅在固定特征值和谱半径下变化特征向量几何，另一种仅在固定强正规几何条件下变化应用输入统计量。随后，我们为平面四旋翼任务提供机制一致的外部验证。在整个过程中，Koopman/ALE替代工具仅作为分析和认证工具使用，而非直接的性能路径。综合来看，这些结果支持了一个更狭窄的主张：在研究的环境中，非正态瞬态放大是执行时间闭环方差的重要且未被充分强调的因素，而源端抑制可以在不改变结构峰值增益的情况下减少下游协方差。

Efficient Federated RLHF via Zeroth-Order Policy Optimization

通过零阶策略优化实现高效的联邦RLHF

Authors: Deyi Wang, Qining Zhang, Lei Ying
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2604.17747
Pdf link: https://arxiv.org/pdf/2604.17747
Abstract This paper considers reinforcement learning from human feedback in a federated learning setting with resource-constrained agents, such as edge devices. We propose an efficient federated RLHF algorithm, named Partitioned, Sign-based Stochastic Zeroth-order Policy Optimization (Par-S$^2$ZPO). The algorithm is built on zeroth-order optimization with binary perturbation, resulting in low communication, computation, and memory complexity by design. Our theoretical analysis establishes an upper bound on the convergence rate of Par-S$^2$ZPO, revealing that it is as efficient as its centralized counterpart in terms of sample complexity but converges faster in terms of policy update iterations. Our experimental results show that it outperforms a FedAvg-based RLHF on four MuJoCo RL tasks.
中文摘要 本文探讨了在资源受限代理（如边缘设备）的联合学习环境中，基于人类反馈的强化学习。我们提出了一种高效的联邦RLHF算法，名为分区、基于符号的随机零阶策略优化（Par-S$^2$ZPO）。该算法基于零阶优化和二进制微扰，设计上实现了低通信、低计算和低内存复杂度。我们的理论分析为Par-S$^2$ZPO的收敛率设定了上界，表明其在样本复杂度方面与中心化对应算法相当，但在策略更新迭代上收敛更快。我们的实验结果显示，它在四个MuJoCo强化任务中优于基于FedAvg的RLHF。

Reverse Constitutional AI: A Framework for Controllable Toxic Data Generation via Probability-Clamped RLAIF

逆宪法人工智能：通过概率限制RLAIF实现可控有毒数据生成的框架

Authors: Yuan Fang, Yiming Luo, Aimin Zhou, Fei Tan
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2604.17769
Pdf link: https://arxiv.org/pdf/2604.17769
Abstract Ensuring the safety of large language models (LLMs) requires robust red teaming, yet the systematic synthesis of high-quality toxic data remains under-explored. We propose Reverse Constitutional AI (R-CAI), a framework for automated and controllable adversarial data generation that moves beyond isolated jailbreak prompts. By inverting a harmless constitution into a constitution of toxicity and iteratively refining model outputs through a critique--revision pipeline, R-CAI enables scalable synthesis of multi-dimensional adversarial data without human annotation. Optimizing solely for toxicity-related rewards, however, can lead to reward hacking and degraded semantic coherence. To address this challenge, we introduce probability clamping within reinforcement learning from AI feedback, which stabilizes adversarial optimization while preserving adversarial intent. Experiments demonstrate that R-CAI generates diverse, high-quality toxic data and that probability clamping substantially improves semantic coherence (15%) without sacrificing adversarial strength. Overall, R-CAI provides a fully automated framework for red teaming data generation and systematic safety evaluation of aligned language models.
中文摘要 确保大型语言模型（LLMs）的安全性需要强有力的红队协作，但高质量有毒数据的系统综合仍然未被充分探索。我们提出了反向宪法人工智能（R-CAI），这是一个自动化且可控的对抗性数据生成框架，超越孤立的越狱提示。通过将无害的构成反转为毒性构成，并通过批判-修订流程迭代优化模型输出，R-CAI实现了无需人工注释即可扩展的多维对抗性数据综合。然而，仅针对毒性相关奖励进行优化，可能导致奖励黑客攻击和语义一致性下降。为应对这一挑战，我们在基于AI反馈的强化学习中引入了概率钳制，稳定了对抗性优化，同时保持了对抗意图。实验表明，R-CAI能够生成多样且高质量的毒性数据，概率钳制在不牺牲对抗强度的前提下，显著提升语义一致性（15%）。总体而言，R-CAI为红队数据生成和系统性安全性评估对比语言模型提供了全自动化框架。

Re$^2$MoGen: Open-Vocabulary Motion Generation via LLM Reasoning and Physics-Aware Refinement

Re$^2$MoGen：通过大型语言模型推理和物理感知精炼实现的开放词汇运动生成

Authors: Jiakun Zheng, Ting Xiao, Shiqin Cao, Xinran Li, Zhe Wang, Chenjia Bai
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2604.17807
Pdf link: https://arxiv.org/pdf/2604.17807
Abstract Text-to-motion (T2M) generation aims to control the behavior of a target character via textual descriptions. Leveraging text-motion paired datasets, existing T2M models have achieved impressive performance in generating high-quality motions within the distribution of their training data. However, their performance deteriorates notably when the motion descriptions differ significantly from the training texts. To address this issue, we propose Re$^2$MoGen, a Reasoning and Refinement open-vocabulary Motion Generation framework that leverages enhanced Large Language Model (LLM) reasoning to generate an initial motion planning and then refine its physical plausibility via reinforcement learning (RL) post-training. Specifically, Re$^2$MoGen consists of three stages: We first employ Monte Carlo tree search to enhance the LLM's reasoning ability in generating reasonable keyframes of the motion based on text prompts, specifying only the root and several key joints' positions to ease the reasoning process. Then, we apply a human pose model as a prior to optimize the full-body poses based on the planned keyframes and use the resulting incomplete motion to supervise fine-tuning a pre-trained motion generator via a dynamic temporal matching objective, enabling spatiotemporal completion. Finally, we use post-training with physics-aware reward to refine motion quality to eliminate physical implausibility in LLM-planned motions. Extensive experiments demonstrate that our framework can generate semantically consistent and physically plausible motions and achieve state-of-the-art performance in open-vocabulary motion generation.
中文摘要 文本到运动（T2M）生成旨在通过文本描述控制目标角色的行为。利用文本-运动配对数据集，现有的T2M模型在生成高质量运动方面取得了显著表现。然而，当动作描述与训练文本有显著差异时，其性能会明显下降。为解决这一问题，我们提出了 Re$^2$MoGen，一个推理与精炼开放词汇动作生成框架，利用增强型大型语言模型（LLM）推理生成初始动作规划，然后通过强化学习（RL）在训练后进一步完善其物理可行性。具体来说，Re$^2$MoGen 包含三个阶段：首先采用蒙特卡洛树搜索，增强大语言模型根据文本提示生成合理关键帧的推理能力，仅指定根节点和几个关键节点的位置，以简化推理过程。然后，我们应用人类姿态模型作为前置，基于计划的关键帧优化全身姿势，并利用所得的不完整动作监督通过动态时间匹配目标微调预训练的运动发生器，实现时空补全。最后，我们利用带有物理感知奖励的后期训练，优化动作质量，消除LLM计划动作中的物理不合理性。大量实验表明，我们的框架能够生成语义一致且物理上合理的动作，并在开放词汇运动生成方面实现最先进的性能。

DART: Learning-Enhanced Model Predictive Control for Dual-Arm Non-Prehensile Manipulation

DART：双臂非抓握操作的学习增强模型预测控制

Authors: Autrio Das, Shreya Bollimuntha, Madala Venkata Renu Jeevesh, Keshab Patra, Tashmoy Gosh, Nagamanikandan G, Arun Kumar, Madhava Krishna
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2604.17833
Pdf link: https://arxiv.org/pdf/2604.17833
Abstract What appears effortless to a human waiter remains a major challenge for robots. Manipulating objects nonprehensilely on a tray is inherently difficult, and the complexity is amplified in dual-arm settings. Such tasks are highly relevant to service robotics in domains such as hotels and hospitality, where robots must transport and reposition diverse objects with precision. We present DART, a novel dual-arm framework that integrates nonlinear Model Predictive Control (MPC) with an optimization-based impedance controller to achieve accurate object motion relative to a dynamically controlled tray. The framework systematically evaluates three complementary strategies for modeling tray-object dynamics as the state transition function within our MPC formulation: (i) a physics-based analytical model, (ii) an online regression based identification model that adapts in real-time, and (iii) a reinforcement learning-based dynamics model that generalizes across object properties. Our pipeline is validated in simulation with objects of varying mass, geometry, and friction coefficients. Extensive evaluations highlight the trade-offs among the three modeling strategies in terms of settling time, steady-state error, control effort, and generalization across objects. To the best of our knowledge, DART constitutes the first framework for non-prehensile dual-arm manipulation of objects on a tray. Project Link: this https URL
中文摘要 对人类服务员来说看似轻松的事情，对机器人来说依然是一大挑战。在托盘上非可抓地操作物体本质上很困难，而在双臂操作下，复杂性会被放大。这些任务与酒店和酒店业等领域的服务机器人高度相关，机器人需要精确地运输和重新定位各种物体。我们介绍DART，一种新型双臂框架，将非线性模型预测控制（MPC）与基于优化的阻抗控制器集成，实现相对于动态控制托盘的物体运动的准确性。该框架系统地评估了我们MPC中三种互补策略，用于将托盘-对象动态建模为状态转移函数：（i）基于物理的分析模型，（ii）实时适应的在线回归识别模型，以及（iii）基于强化学习的动态模型，能够跨对象属性推广。我们的流水线在模拟中验证了质量、几何形状和摩擦系数不同的对象。大量评估强调了三种建模策略在稳定时间、稳态误差、控制工作量以及跨对象泛化方面的权衡。据我们所知，DART是第一个非抓握式双臂操作托盘上物体的框架。项目链接：此 https URL

LEPO: \underline{L}atent R\underline{e}asoning \underline{P}olicy \underline{O}ptimization for Large Language~Models

LEPO： \underline{L}atent R\underline{e}asoning \underline{P}olicy \underline{O}ptimization for Large Language~Models

Authors: Yuyan Zhou, Jiarui Yu, Hande Dong, Zhezheng Hao, Hong Wang, Jianqing Zhang, Qiang Lin
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2604.17892
Pdf link: https://arxiv.org/pdf/2604.17892
Abstract Recently, latent reasoning has been introduced into large language models (LLMs) to leverage rich information within a continuous space. However, without stochastic sampling, these methods inevitably collapse to deterministic inference, failing to discover diverse reasoning paths. To bridge the gap, we inject controllable stochasticity into latent reasoning via Gumbel-Softmax, restoring LLMs' exploratory capacity and enhancing their compatibility with Reinforcement Learning (RL). Building on this, we propose \textbf{\underline{L}}atent R\textbf{\underline{e}}asoning \textbf{\underline{P}}olicy \textbf{\underline{O}}ptimization~(\textbf{LEPO}), a novel framework that applies RL directly to continuous latent representations. Specifically, in rollout stage, LEPO maintains stochasticity to enable diverse trajectory sampling, while in optimization stage, LEPO constructs a unified gradient estimation for both latent representations and discrete tokens. Extensive experiments show that LEPO significantly outperforms existing RL methods for discrete and latent reasoning.
中文摘要 近年来，潜能推理被引入大型语言模型（LLM），以在连续空间中利用丰富的信息。然而，如果没有随机抽样，这些方法不可避免地会归结为确定性推断，无法发现多样的推理路径。为弥合这一差距，我们通过Gumbel-Softmax向潜在推理注入可控随机性，恢复LLM的探索能力，并增强其与强化学习（RL）的兼容性。基于此，我们提出了 \textbf{\underline{L}}atent R\textbf{\underline{e}}asoning \textbf{\underline{P}}olicy \textbf{\underline{O}}ptimization~（\textbf{LEPO}），这是一个新颖框架，直接将强化学习应用于连续潜在表征。具体来说，在推广阶段，LEPO保持随机性以实现多样化轨迹采样，而在优化阶段，LEPO构建潜在表示和离散代币的统一梯度估计。大量实验表明，LEPO在离散和潜在推理方面显著优于现有的强化学习方法。

Fisher Decorator: Refining Flow Policy via A Local Transport Map

Fisher 装饰师：通过本地交通地图优化流量政策

Authors: Xiaoyuan Cheng, Haoyu Wang, Wenxuan Yuan, Ziyan Wang, Zonghao Chen, Li Zeng, Zhuo Sun
Subjects: Subjects: Machine Learning (cs.LG); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2604.17919
Pdf link: https://arxiv.org/pdf/2604.17919
Abstract Recent advances in flow-based offline reinforcement learning (RL) have achieved strong performance by parameterizing policies via flow matching. However, they still face critical trade-offs among expressiveness, optimality, and efficiency. In particular, existing flow policies interpret the $L_2$ regularization as an upper bound of the 2-Wasserstein distance ($W_2$), which can be problematic in offline settings. This issue stems from a fundamental geometric mismatch: the behavioral policy manifold is inherently anisotropic, whereas the $L_2$ (or upper bound of $W_2$) regularization is isotropic and density-insensitive, leading to systematically misaligned optimization directions. To address this, we revisit offline RL from a geometric perspective and show that policy refinement can be formulated as a local transport map: an initial flow policy augmented by a residual displacement. By analyzing the induced density transformation, we derive a local quadratic approximation of the KL-constrained objective governed by the Fisher information matrix, enabling a tractable anisotropic optimization formulation. By leveraging the score function embedded in the flow velocity, we obtain a corresponding quadratic constraint for efficient optimization. Our results reveal that the optimality gap in prior methods arises from their isotropic approximation. In contrast, our framework achieves a controllable approximation error within a provable neighborhood of the optimal solution. Extensive experiments demonstrate state-of-the-art performance across diverse offline RL benchmarks. See project page: this https URL.
中文摘要 基于流的离线强化学习（RL）的最新进展通过通过流量匹配参数化策略，取得了强劲的性能。然而，它们仍然面临表现力、最优性和效率之间的关键权衡。特别地，现有的流策略将$L_2$正则化解释为2-Wasserstein距离（$W_2$）的上界，这在离线环境中可能存在问题。这个问题源于一个基本的几何不匹配：行为策略流形本质上是各向异性的，而$L_2$（或$W_2$的上界）正则化是各向同性的且对密度不敏感，导致优化方向系统性错位。为此，我们从几何视角重新审视离线强化学习，并展示了策略细化可以被表述为局部运输地图：初始流动策略由残余位移补充。通过分析诱导的密度变换，我们推导出由Fisher信息矩阵支配的KL约束目标的局部二次近似，从而实现了可解的各向异性优化表述。通过利用嵌入在流速中的得分函数，我们得到了相应的二次约束以实现高效优化。我们的结果表明，以往方法中的最优性缺口源于其各向同性近似。相比之下，我们的框架在最优解的可证邻域内实现了可控的近似误差。大量实验展示了在多种离线强化学习基准测试中的最先进性能。请参见项目页面：这个 https URL。

HEALing Entropy Collapse: Enhancing Exploration in Few-Shot RLVR via Hybrid-Domain Entropy Dynamics Alignment

熵坍缩的修复：通过混合域熵动力学对齐增强少数样本RLVR的探索

Authors: Zhanyu Liu, Qingguo Hu, Ante Wang, Chenqing Liu, Zhishang Xiang, Hui Li, Delai Qiu, Jinsong Su
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2604.17928
Pdf link: https://arxiv.org/pdf/2604.17928
Abstract Reinforcement Learning with Verifiable Reward (RLVR) has proven effective for training reasoning-oriented large language models, but existing methods largely assume high-resource settings with abundant training data. In low-resource scenarios, RLVR is prone to more severe entropy collapse, which substantially limits exploration and degrades reasoning performance. To address this issue, we propose Hybrid-domain Entropy dynamics ALignment (HEAL), a framework tailored for few-shot RLVR. HEAL first selectively incorporates high-value general-domain data to promote more diverse exploration. Then, we introduce Entropy Dynamics Alignment (EDA), a reward mechanism that aligns trajectory-level entropy dynamics between the target and general domains, capturing both entropy magnitude and fine-grained variation. Through this alignment, EDA not only further mitigates entropy collapse but also encourages the policy to acquire more diverse exploration behaviors from the general domain. Experiments across multiple domains show that HEAL consistently improves few-shot RLVR performance. Notably, using only 32 target-domain samples, HEAL matches or even surpasses full-shot RLVR trained with 1K target-domain samples.
中文摘要 带可验证奖励的强化学习（RLVR）已被证明对训练以推理为导向的大型语言模型有效，但现有方法大多假设资源充足且训练数据丰富的环境。在资源匮乏的场景下，RLVR更容易发生更严重的熵崩溃，这大大限制了探索并降低了推理能力。为解决这一问题，我们提出了混合域熵动态均衡（HEAL）框架，这是一个专为少数 RLVR 设计的框架。HEAL首先有选择地纳入高价值的广域数据，以促进更多样化的探索。随后，我们介绍了熵动力学对齐（EDA），这是一种奖励机制，能够对齐目标与一般域之间的轨迹级熵动态，同时捕捉熵大小和细粒度变化。通过这种对齐，EDA不仅进一步减轻了熵坍缩，还鼓励政策从一般领域中获得更多样化的探索行为。多个领域的实验表明，HEAL持续提升了少数发电RLVR的性能。值得注意的是，仅使用32个目标域样本，HEAL就能匹配甚至超过用1K靶域样本训练的全剂量RLVR。

LiteResearcher: A Scalable Agentic RL Training Framework for Deep Research Agent

LiteResearcher：一个可扩展的智能强化学习深度研究代理训练框架

Authors: Wanli Li, Bince Qu, Bo Pan, Jianyu Zhang, Zheng Liu, Pan Zhang, Wei Chen, Bo Zhang
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2604.17931
Pdf link: https://arxiv.org/pdf/2604.17931
Abstract Reinforcement Learning (RL) has emerged as a powerful training paradigm for LLM-based agents. However, scaling agentic RL for deep research remains constrained by two coupled challenges: hand-crafted synthetic data fails to elicit genuine real-world search capabilities, and real-world search dependency during RL training introduces instability and prohibitive cost, which limits the scalability of Agentic RL. LiteResearcher is a training framework that makes Agentic RL scalable: by constructing a lite virtual world that mirrors real-world search dynamics, we enable a continuously improving training recipe that empowers a tiny search agent to outperform large-scale open-source and commercial models (e.g., Tongyi DeepResearch and Claude-4.5 Sonnet). Specifically, on common benchmarks such as GAIA and Xbench, our LiteResearcher-4B achieves open-source state-of-the-art results of 71.3% and 78.0% respectively, demonstrating that scalable RL training is a key enabler for Deep Research Agents.
中文摘要 强化学习（RL）已成为基于LLM的智能体的强大训练范式。然而，针对深度研究的智能强化学习（agentic RL）扩展仍受两个挑战的限制：手工合成数据无法激发真实的现实世界搜索能力，且在强化学习训练中对现实搜索的依赖带来了不稳定性和高昂的成本，限制了智能强化学习的可扩展性。LiteResearcher 是一个训练框架，使代理式强化学习具备可扩展性：通过构建一个反映现实搜索动态的轻量虚拟世界，我们实现了不断改进的训练配方，使微型搜索代理能够超越大型开源和商业模型（如通仪 DeepResearch 和 Claude-4.5 Sonnet）。具体来说，在GAIA和Xbench等常见基准测试中，我们的LiteResearcher-4B分别实现了71.3%和78.0%的开源最先进成绩，证明可扩展的强化学习训练是深度研究代理的关键推动力。

Modeling Multiple Support Strategies within a Single Turn for Emotional Support Conversations

在单一回合内建模多种情感支持策略以促进情感支持对话

Authors: Jie Zhu, Huaixia Dou, Junhui Li, Lifan Guo, Feng Chen, Jinsong Su, Chi Zhang, Fang Kong
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2604.17972
Pdf link: https://arxiv.org/pdf/2604.17972
Abstract Emotional Support Conversation (ESC) aims to assist individuals experiencing distress by generating empathetic and supportive dialogue. While prior work typically assumes that each supporter turn corresponds to a single strategy, real-world supportive communication often involves multiple strategies within a single utterance. In this paper, we revisit the ESC task by formulating it as multi-strategy utterance generation, where each utterance may contain one or more strategy-response pairs. We propose two generation methods: All-in-One, which predicts all strategy-response pairs in a single decoding step, and One-by-One, which iteratively generates strategy-response pairs until completion. Both methods are further enhanced with cognitive reasoning guided by reinforcement learning to improve strategy selection and response composition. We evaluate our models on the ESConv dataset under both utterance-level and dialogue-level settings. Experimental results show that our methods effectively model multi-strategy utterances and lead to improved supportive quality and dialogue success. To our knowledge, this work provides the first systematic empirical evidence that allowing multiple support strategies within a single utterance is both feasible and beneficial for emotional support conversations. All code and data will be publicly available at this https URL.
中文摘要 情感支持对话（ESC）旨在通过产生同理心和支持性的对话，帮助经历痛苦的个人。虽然以往的研究通常假设每个支持者回合对应单一策略，但现实中的支持性沟通往往涉及单一话语中的多种策略。本文通过将ESC任务表述为多策略话语生成，每个话语可能包含一个或多个策略-反应对。我们提出了两种生成方法：一体化，即在单一解码步骤内预测所有策略-响应对，以及逐一生成策略-响应对直到完成。这两种方法都通过强化学习引导的认知推理进一步增强，以改善策略选择和反应组合。我们在ESConv数据集上，在话语层级和对话层级设置下评估模型。实验结果表明，我们的方法有效模拟多策略言论，提升支持性质和对话成功率。据我们所知，这项工作首次系统地实证证据表明，在单一话语中允许多种支持策略既可行又有益于情感支持对话。所有代码和数据都将在此 https URL 公开。

Neural Garbage Collection: Learning to Forget while Learning to Reason

神经垃圾回收：在学习推理的同时学会遗忘

Authors: Michael Y. Li, Jubayer Ibn Hamid, Emily B. Fox, Noah D. Goodman
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2604.18002
Pdf link: https://arxiv.org/pdf/2604.18002
Abstract Chain-of-thought reasoning has driven striking advances in language model capability, yet every reasoning step grows the KV cache, creating a bottleneck to scaling this paradigm further. Current approaches manage these constraints on the model's behalf using hand-designed criteria. A more scalable approach would let end-to-end learning subsume this design choice entirely, following a broader pattern in deep learning. After all, if a model can learn to reason, why can't it learn to forget? We introduce Neural Garbage Collection (NGC), in which a language model learns to forget while learning to reason, trained end-to-end from outcome-based task reward alone. As the model reasons, it periodically pauses, decides which KV cache entries to evict, and continues to reason conditioned on the remaining cache. By treating tokens in a chain-of-thought and cache-eviction decisions as discrete actions sampled from the language model, we can use reinforcement learning to jointly optimize how the model reasons and how it manages its own memory: what the model evicts shapes what it remembers, what it remembers shapes its reasoning, and the correctness of that reasoning determines its reward. Crucially, the model learns this behavior entirely from a single learning signal - the outcome-based task reward - without supervised fine-tuning or proxy objectives. On Countdown, AMC, and AIME tasks, NGC maintains strong accuracy relative to the full-cache upper bound at 2-3x peak KV cache size compression and substantially outperforms eviction baselines. Our results are a first step towards a broader vision where end-to-end optimization drives both capability and efficiency in language models.
中文摘要 思维链推理推动了语言模型能力的显著进步，但每一步推理都在扩大KV缓存，成为进一步扩展这一范式的瓶颈。当前的方法代表模型管理这些约束，使用手工设计的标准。更具可扩展性的方法是让端到端学习完全涵盖这一设计选择，遵循深度学习中的更广泛模式。毕竟，如果模型能学会推理，为什么不能学会遗忘呢？我们介绍了神经垃圾回收（NGC），其中一个语言模型在学习推理的同时学会遗忘，仅靠基于结果的任务奖励进行端到端训练。模型推断时会暂停，决定要淘汰哪些KV缓存条目，并继续基于剩余缓存进行推理。通过将思维链中的标记和缓存-驱逐决策视为从语言模型中抽样的离散动作，我们可以利用强化学习共同优化模型的推理和管理自身记忆：模型驱逐的内容塑造了它的记忆，记忆的决定了它的推理，而推理的正确性决定了其奖励。关键是，模型完全从单一学习信号——基于结果的任务奖励中学习这种行为，无需监督微调或代理目标。在倒计时、AMC和AIME任务中，NGC在2-3倍KV峰值缓存压缩下保持了相对于满缓存上限的强强准确性，并且远超淘汰基线。我们的成果是迈向更广泛愿景的第一步，即端到端优化推动语言模型的能力和效率。

SELF-EMO: Emotional Self-Evolution from Recognition to Consistent Expression

自我情绪化：从认知到持续表达的情感自我演化

Authors: Shaowei Zhang, Faqiang Qian, Yan Chen, Ziliang Wang, Kang An, Yong Dai, Mengya Gao, Yichao Wu
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2604.18003
Pdf link: https://arxiv.org/pdf/2604.18003
Abstract Emotion Recognition in Conversation (ERC) has become a fundamental capability for large language models (LLMs) in human-centric interaction. Beyond accurate recognition, coherent emotional expression is also crucial, yet both are limited by the scarcity and static nature of high-quality annotated data. In this work, we propose SELF-EMO, a self-evolution framework grounded in the hypothesis that better emotion prediction leads to more consistent emotional responses. We introduce two auxiliary tasks, emotional understanding and emotional expression, and design a role-based self-play paradigm where the model acts as both an emotion recognizer and a dialogue responder. Through iterative interactions, the model generates diverse conversational trajectories, enabling scalable data generation. To ensure quality, we adopt a data flywheel mechanism that filters candidate predictions and responses using a smoothed IoU-based reward and feeds selected samples back for continuous self-improvement without external supervision. We further develop SELF-GRPO, a reinforcement learning algorithm that stabilizes optimization with multi-label alignment rewards and group-level consistency signals. Experiments on IEMOCAP, MELD, and EmoryNLP show that SELF-EMO achieves state-of-the-art performance, improving accuracy by +6.33% on Qwen3-4B and +8.54% on Qwen3-8B, demonstrating strong effectiveness and generalization.
中文摘要 对话中的情感识别（ERC）已成为大型语言模型（LLMs）在以人为中心的互动中的基本能力。除了准确识别外，连贯的情感表达也至关重要，但两者都受限于高质量注释数据的稀缺性和静态性。在本研究中，我们提出了SELF-EMO，一种基于更优情绪预测能带来更一致情绪反应的自我进化框架。我们引入了两个辅助任务：情感理解和情感表达，并设计了一个基于角色的角色自我扮演范式，模型既是情绪识别者，也是对话响应者。通过迭代交互，模型生成多样化的对话轨迹，实现可扩展的数据生成。为确保质量，我们采用数据飞轮机制，利用基于借据的平滑奖励过滤候选预测和回答，并将选中的样本反馈回来，实现无外部监督的持续自我改进。我们进一步开发了SELF-GRPO，这是一种强化学习算法，通过多标签比对奖励和组级一致性信号稳定优化。IEMOCAP、MELD和EmoryNLP的实验显示，SELF-EMO实现了最先进的性能，Qwen3-4B的准确率提升了+6.33%，Qwen3-8B的准确率提升了+8.54%，展现出强大的有效性和泛化性。

CodePivot: Bootstrapping Multilingual Transpilation in LLMs via Reinforcement Learning without Parallel Corpora

CodePivot：通过强化学习在LLM中自助实现多语言转译，无需并行语料

Authors: Shangyu Li, Juyong Jiang, Meibo Ren, Sizhe Zhong, Huiri Tan, Yunhao Gou, Xu Han, Chun Yong Chong, Yun Peng, Jiasi Shen
Subjects: Subjects: Software Engineering (cs.SE); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2604.18027
Pdf link: https://arxiv.org/pdf/2604.18027
Abstract Transpilation, or code translation, aims to convert source code from one programming language (PL) to another. It is beneficial for many downstream applications, from modernizing large legacy codebases to augmenting data for low-resource PLs. Recent large language model (LLM)-based approaches have demonstrated immense potential for code translation. Among these approaches, training-based methods are particularly important because LLMs currently do not effectively adapt to domain-specific settings that suffer from a lack of knowledge without targeted training. This limitation is evident in transpilation tasks involving low-resource PLs. However, existing training-based approaches rely on a pairwise transpilation paradigm, making it impractical to support a diverse range of PLs. This limitation is particularly prominent for low-resource PLs due to a scarcity of training data. Furthermore, these methods suffer from suboptimal reinforcement learning (RL) reward formulations. To address these limitations, we propose CodePivot, a training framework that leverages Python as an intermediate representation (IR), augmented by a novel RL reward mechanism, Aggressive-Partial-Functional reward, to bootstrap the model's multilingual transpilation ability without requiring parallel corpora. Experiments involving 10 PLs show that the resulting 7B model, trained on Python-to-Others tasks, consistently improves performance across both general and low-resource PL-related transpilation tasks. It outperforms substantially larger mainstream models with hundreds of billions more parameters, such as Deepseek-R1 and Qwen3-235B-A22B-Instruct-2507, on Python-to-Others tasks and Others-to-All tasks, respectively. In addition, it outperforms its counterpart trained directly on Any-to-Any tasks on general transpilation tasks. The code and data are available at this https URL.
中文摘要 转译，或称代码翻译，旨在将源代码从一种编程语言（PL）转换到另一种语言。它对许多下游应用都有益，从现代化大型遗留代码库到为低资源PL补充数据。近期基于大型语言模型（LLM）的方法在代码翻译方面展现出巨大潜力。在这些方法中，基于训练的方法尤为重要，因为LLM目前无法有效适应缺乏知识的领域特定环境，除非有针对性训练。这一局限性在涉及低资源PL的转译任务中尤为明显。然而，现有基于训练的方法依赖于成对转编范式，这使得支持多样化的 PL 变得不切实际。由于训练数据稀缺，这一限制在资源有限的PL中尤为明显。此外，这些方法存在强化学习（RL）奖励表述的不优问题。为解决这些限制，我们提出了CodePivot，这是一个利用Python作为中间表示（IR）的训练框架，辅以一种新颖的强化学习奖励机制——激进-部分-功能奖励，以在无需并行语料库的情况下，自举模型的多语言转译能力。涉及10个PL的实验表明，基于Python到他人任务训练的7B模型，在通用和低资源PL相关转译任务中都能持续提升性能。它在Python对他人任务和他人对所有人任务中，表现优于拥有数千亿参数的主流模型，如Deepseek-R1和Qwen3-235B-A22B-Instruct-2507。此外，它在通用转录任务中优于直接训练于任一到任一任务的对应者。代码和数据可在该 https URL 访问。

ConventionPlay: Capability-Limited Training for Robust Ad-Hoc Collaboration

ConventionPlay：能力有限的培训，促进强有力的临时协作

Authors: Abhishek Sriraman, Eleni Vasilaki, Robert Loftin
Subjects: Subjects: Multiagent Systems (cs.MA)
Arxiv link: https://arxiv.org/abs/2604.18123
Pdf link: https://arxiv.org/pdf/2604.18123
Abstract Ad-hoc collaboration often relies on identifying and adhering to shared conventions. However, when partners can follow multiple conventions, agents must do more than simply adapt; they must actively steer the team toward the most effective joint strategy. We present ConventionPlay, a reinforcement learning-based approach that extends cognitive hierarchies to include a diverse population of adaptive followers. By training against partners with varied capability limits, our agent learns to probe its partner's repertoire, leading the team when possible and following when necessary. Our results in canonical coordination tasks show that ConventionPlay achieves superior coordination efficiency, particularly in settings where conventions have differentiated payoffs.
中文摘要 临时协作通常依赖于识别并遵守共享的惯例。然而，当合作伙伴可以遵循多种惯例时，代理必须做的不仅仅是适应;他们必须积极引导团队走向最有效的联合战略。我们介绍了ConventionPlay，一种基于强化学习的方法，将认知层级扩展到涵盖多样化的适应性跟随者群体。通过与能力极限不同的搭档训练，我们的探员学会了探查搭档的技能库，在可能时带领团队，必要时跟随。我们在规范协调任务中的结果表明，ConventionPlay在约定收益差异化的环境中，实现了更优越的协调效率。

Frugal Geofencing via Energy-aware Sensing and Reporting

通过能源感知和报告实现节约地理围栏

Authors: David E. Ruiz-Guirola, Miltiadis Filippou, Onel A. Lopez
Subjects: Subjects: Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2604.18141
Pdf link: https://arxiv.org/pdf/2604.18141
Abstract Timely and accurate monitoring in geofencing scenarios is challenging when relying on ultra-low power Internet of Things devices (IoTDs) powered by energy harvesting (EH). This is mainly because frequent wake-ups for data acquisition and data uploading may quickly deplete their limited energy buffer. Conventional grid-like IoT deployments overlook these limitations and merely rely on continuously powered sensing. Herein, we propose an energy-aware geofencing framework for camera-equipped EH IoTDs deployed around a protected area and its surrounding perimeter zone. The framework integrates a directional sensing power model with an operational representation of EH, sensing, sleeping, and reporting, accounting for the limited field-of-view (FoV) and distance-dependent detection confidence of the IoTDs. Device activity is controlled by the coverage-providing access point, which hosts a mobile edge host and a facility geocencing system to ensure timely and reliable detection under tight energy constraints. Reinforcement learning is used to determine IoTD placement, enabling earlier intruder detection than uniform grid-based deployments. Numerical results show that the proposed coordinated sensing and reporting configuration achieves frugal geofencing with fewer devices, while concurrently improving detection timeliness and dependability.
中文摘要 在地理围栏场景中，依赖由能量收集（EH）驱动的超低功耗物联网设备（IoTDs）时，及时且准确的监测充满挑战。这主要是因为频繁的唤醒以进行数据采集和上传，可能会迅速耗尽其有限的能量缓冲。传统的类网格物联网部署忽略了这些限制，仅依靠持续供电的传感。本文提出一种能源感知的地理围栏框架，适用于部署在保护区及其周边周边区域的摄像头EH IoTDs。该框架将方向感测功率模型与EH、感测、睡眠和报告的操作表示相结合，考虑了IoTDs有限的视场（FoV）和距离依赖的探测置信度。设备活动由覆盖覆盖接入点控制，接入点托管移动边缘主机和设施地理测定系统，确保在严格的能源限制下及时且可靠地检测。强化学习用于确定IoTD的定位，使入侵者比基于网格的均匀部署更早被发现。数值结果表明，所提的协调感测与报告配置以更少设备实现节约地理围栏，同时提升检测时效性和可靠性。

Does "Do Differentiable Simulators Give Better Policy Gradients?'' Give Better Policy Gradients?

“可微模拟器是否能提供更好的策略梯度？”给出更好的政策梯度？

Authors: Ku Onoda, Paavo Parmas, Manato Yaguchi, Yutaka Matsuo
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2604.18161
Pdf link: https://arxiv.org/pdf/2604.18161
Abstract In policy gradient reinforcement learning, access to a differentiable model enables 1st-order gradient estimation that accelerates learning compared to relying solely on derivative-free 0th-order estimators. However, discontinuous dynamics cause bias and undermine the effectiveness of 1st-order estimators. Prior work addressed this bias by constructing a confidence interval around the REINFORCE 0th-order gradient estimator and using these bounds to detect discontinuities. However, the REINFORCE estimator is notoriously noisy, and we find that this method requires task-specific hyperparameter tuning and has low sample efficiency. This paper asks whether such bias is the primary obstacle and what minimal fixes suffice. First, we re-examine standard discontinuous settings from prior work and introduce DDCG, a lightweight test that switches estimators in nonsmooth regions; with a single hyperparameter, DDCG achieves robust performance and remains reliable with small samples. Second, on differentiable robotics control tasks, we present IVW-H, a per-step inverse-variance implementation that stabilizes variance without explicit discontinuity detection and yields strong results. Together, these findings indicate that while estimator switching improves robustness in controlled studies, careful variance control often dominates in practical deployments.
中文摘要 在策略梯度强化学习中，使用可微模型实现一阶梯度估计，加快学习速度，相较于仅依赖无导数的零阶估计。然而，不连续动力学会导致偏差，削弱一阶估计器的有效性。此前的研究通过围绕REINFORCE零阶梯度估计器构建置信区间，并利用这些界限检测不连续点来解决这一偏倚。然而，REINFORCE估计器噪声较大，我们发现该方法需要针对特定任务的超参数调优，且采样效率较低。本文探讨了这种偏见是否是主要障碍，以及哪些最小的修正措施足够。首先，我们重新审视之前工作的标准不连续设置，并引入DDCG，这是一种轻量级测试，在非平滑区域切换估计量;采用单一超参数时，DDCG实现了稳健的性能，并且在小样本下依然可靠。其次，在可微机器人控制任务中，我们介绍了IVW-H，一种每步逆方差实现，能够稳定方差且无需显式不连续检测，并取得强有力的结果。这些发现表明，虽然估计器切换在受控研究中提高了鲁棒性，但在实际部署中，谨慎的方差控制往往占主导地位。

QuantumQA: Enhancing Scientific Reasoning via Physics-Consistent Dataset and Verification-Aware Reinforcement Learning

量子质量保证：通过物理一致性数据集和验证感知强化学习提升科学推理能力

Authors: Songxin Qu, Tai-Ping Sun, Yun-Jie Wang, Huan-Yu Liu, Cheng Xue, Xiao-Fan Xu, Han Fang, Yang Yang, Yu-Chun Wu, Guo-Ping Guo, Zhao-Yun Chen
Subjects: Subjects: Artificial Intelligence (cs.AI); Quantum Physics (quant-ph)
Arxiv link: https://arxiv.org/abs/2604.18176
Pdf link: https://arxiv.org/pdf/2604.18176
Abstract Large language models (LLMs) show strong capabilities in general reasoning but typically lack reliability in scientific domains like quantum mechanics, which demand strict adherence to physical constraints. This limitation arises from the scarcity of verifiable training resources and the inadequacy of coarse feedback signals in standard alignment paradigms. To address the data challenge, we introduce QuantumQA, a large-scale dataset constructed via a task-adaptive strategy and a hybrid verification protocol that combines deterministic solvers with semantic auditing to guarantee scientific rigor. Building on this foundation, we propose the verification-aware reward model (VRM) tailored for Reinforcement Learning with Verifiable Rewards (RLVR), which employs an adaptive reward fusion (ARF) mechanism to dynamically integrate deterministic signals from a scientific execution suite (SES) with multidimensional semantic evaluations for precise supervision. Experimental results demonstrate that our method consistently outperforms baselines and general-purpose preference models. Notably, our optimized 8B model achieves performance competitive with proprietary models, validating that incorporating verifiable, rule-based feedback into the reinforcement learning loop offers a parameter-efficient alternative to pure scaling.
中文摘要 大型语言模型（LLMs）在一般推理方面表现出强大的能力，但在量子力学等科学领域通常缺乏可靠性，因为这些领域要求严格遵守物理约束。这一局限源于可验证的训练资源稀缺以及标准比对范式中粗反馈信号的不足。为应对数据挑战，我们引入了QuantumQA，这是一个通过任务自适应策略和混合验证协议构建的大规模数据集，结合了确定性求解器和语义审计，以确保科学严谨性。在此基础上，我们提出了针对可验证奖励强化学习（RLVR）量身定制的验证感知奖励模型（VRM），采用自适应奖励融合（ARF）机制，将科学执行套件（SES）中的确定性信号与多维语义评估动态整合，实现精确监督。实验结果表明，我们的方法始终优于基线和通用偏好模型。值得注意的是，我们优化的8B模型性能可与专有模型媲美，验证了将可验证的基于规则的反馈纳入强化学习循环，提供了参数效率高的替代方案，取代纯粹的扩展。

Audio-DeepThinker: Progressive Reasoning-Aware Reinforcement Learning for High-Quality Chain-of-Thought Emergence in Audio Language Models

音频深度思考者：渐进式推理感知强化学习，促进音频语言模型中高质量的思维链涌现

Authors: Xiang He, Chenxing Li, Jinting Wang, Yan Rong, Tianxin Xie, Wenfu Wang, Li Liu, Dong Yu
Subjects: Subjects: Sound (cs.SD); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2604.18187
Pdf link: https://arxiv.org/pdf/2604.18187
Abstract Large Audio-Language Models (LALMs) have made significant progress in audio understanding, yet they primarily operate as perception-and-answer systems without explicit reasoning processes. Existing methods for enhancing audio reasoning rely either on supervised chain-of-thought (CoT) fine-tuning, which is limited by training data quality, or on reinforcement learning (RL) with coarse rewards that do not directly evaluate reasoning quality. As a result, the generated reasoning chains often appear well-structured yet lack specific acoustic grounding. We propose Audio-DeepThinker, a framework built on two core ideas. First, we introduce a hybrid reasoning similarity reward that directly supervises the quality of generated reasoning chains by combining an LLM evaluator assessing logical path alignment, key step coverage, and analytical depth with an embedding similarity component enforcing semantic alignment with reference reasoning chains. Second, we propose a progressive two-stage curriculum that enables high-quality CoT reasoning to emerge through pure RL exploration, without any supervised reasoning fine-tuning, from an instruction-tuned model that possesses no prior chain-of-thought capability. Stage 1 trains on foundational audio QA with the hybrid reward to foster basic reasoning patterns, while Stage 2 shifts to acoustically challenging boundary cases with an LLM-only reward for greater reasoning diversity. Audio-DeepThinker achieves state-of-the-art results on MMAR (74.0%), MMAU-test-mini (78.5%), and MMSU (77.26%), winning 1st Place in the Interspeech 2026 Audio Reasoning Challenge (Single Model Track). Interpretability analyses further reveal that RL training primarily reshapes upper-layer MoE gating mechanisms and that reasoning tokens crystallize progressively in the upper transformer layers, offering mechanistic insights into how audio reasoning emerges through exploration.
中文摘要 大型音频语言模型（LALMs）在音频理解方面取得了显著进展，但它们主要作为感知与回答系统运作，缺乏明确的推理过程。现有增强音频推理的方法要么依赖受限于训练数据质量的监督思维链（CoT）微调，要么依赖带有粗奖励但不直接评估推理质量的强化学习（RL）。因此，生成的推理链往往结构良好，但缺乏特定的声学基础。我们提出了Audio-DeepThinker，这是一个基于两个核心理念的框架。首先，我们引入了一种混合推理相似性奖励，通过结合LLM评估器评估逻辑路径对齐、关键步骤覆盖率和分析深度，与嵌入相似性组件强制执行与引用推理链语义对齐，直接监督推理链的质量。其次，我们提出一个渐进式的两阶段课程，使高质量的CoT推理能够通过纯强化学习探索而产生，无需任何监督推理的微调，从一个没有先前思维链能力的指令调优模型中诞生。第一阶段训练基础音频质检，采用混合奖励以培养基本推理模式，第二阶段则转向声学挑战性的边界案例，仅限LLM奖励，以提升推理多样性。Audio-DeepThinker 在 MMAR（74.0%）、MMAU-test-mini（78.5%）和 MMSU（77.26%）中取得了最先进的成绩，荣获 Interspeech 2026 音频推理挑战（单模特赛道）第一名。可解释性分析进一步揭示，强化学习主要重塑上层的 MoE 门控机制，推理令牌在上层变换器层逐渐结晶，为音频推理如何通过探索产生提供了机制性见解。

AJ-Bench: Benchmarking Agent-as-a-Judge for Environment-Aware Evaluation

AJ-Bench：环境感知评估的基准代理作为法官

Authors: Wentao Shi, Yu Wang, Yuyang Zhao, Yuxin Chen, Fuli Feng, Xueyuan Hao, Xi Su, Qi Gu, Hui Su, Xunliang Cai, Xiangnan He
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2604.18240
Pdf link: https://arxiv.org/pdf/2604.18240
Abstract As reinforcement learning continues to scale the training of large language model-based agents, reliably verifying agent behaviors in complex environments has become increasingly challenging. Existing approaches rely on rule-based verifiers or LLM-as-a-Judge models, which struggle to generalize beyond narrow domains. Agent-as-a-Judge addresses this limitation by actively interacting with environments and tools to acquire verifiable evidence, yet its capabilities remain underexplored. We introduce a benchmark AJ-Bench to systematically evaluate Agent-as-a-Judge across three domains-search, data systems, and graphical user interfaces-comprising 155 tasks and 516 annotated trajectories. The benchmark comprehensively assesses judge agents' abilities in information acquisition, state verification, and process verification. Experiments demonstrate consistent performance gains over LLM-as-a-Judge baselines, while also revealing substantial open challenges in agent-based verification. Our data and code are available at this https URL.
中文摘要 随着强化学习不断扩大大型基于语言模型的智能体训练，在复杂环境中可靠验证智能体行为变得越来越具有挑战性。现有方法依赖基于规则的验证器或以LLM为法官的模型，这些模型难以超越狭窄领域进行推广。代理即法官通过积极与环境和工具互动以获取可验证的证据来解决这一局限，但其能力仍未被充分探索。我们引入了一个基准AJ-Bench，系统评估代理作为评判者在搜索、数据系统和图形用户界面三个领域，涵盖155个任务和516条注释轨迹。该基准全面评估法官代理人在信息获取、状态验证和流程验证方面的能力。实验显示，LLM作为评判基线在性能上持续提升，同时也揭示了基于主体验证的重大挑战。我们的数据和代码可在该 https URL 访问。

Agent-World: Scaling Real-World Environment Synthesis for Evolving General Agent Intelligence

代理世界：扩展真实世界环境综合以演进通用智能体智能

Authors: Guanting Dong, Junting Lu, Junjie Huang, Wanjun Zhong, Longxiang Liu, Shijue Huang, Zhenyu Li, Yang Zhao, Xiaoshuai Song, Xiaoxi Li, Jiajie Jin, Yutao Zhu, Hanbin Wang, Fangyu Lei, Qinyu Luo, Mingyang Chen, Zehui Chen, Jiazhan Feng, Ji-Rong Wen, Zhicheng Dou
Subjects: Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2604.18292
Pdf link: https://arxiv.org/pdf/2604.18292
Abstract Large language models are increasingly expected to serve as general-purpose agents that interact with external, stateful tool environments. The Model Context Protocol (MCP) and broader agent skills offer a unified interface for connecting agents with scalable real-world services, but training robust agents remains limited by the lack of realistic environments and principled mechanisms for life-long learning. In this paper, we present \textbf{Agent-World}, a self-evolving training arena for advancing general agent intelligence through scalable environments. Agent-World has two main components: (1) Agentic Environment-Task Discovery, which autonomously explores topic-aligned databases and executable tool ecosystems from thousands of real-world environment themes and synthesizes verifiable tasks with controllable difficulty; and (2) Continuous Self-Evolving Agent Training, which combines multi-environment reinforcement learning with a self-evolving agent arena that automatically identifies capability gaps through dynamic task synthesis and drives targeted learning, enabling the co-evolution of agent policies and environments. Across 23 challenging agent benchmarks, Agent-World-8B and 14B consistently outperforms strong proprietary models and environment scaling baselines. Further analyses reveal scaling trends in relation to environment diversity and self-evolution rounds, offering insights for building general agent intelligence.
中文摘要 大型语言模型越来越被期望作为通用代理，与外部有状态的工具环境交互。模型上下文协议（MCP）及更广泛的代理技能为连接代理与可扩展的现实世界服务提供了统一接口，但由于缺乏现实环境和终身学习的原则机制，强健的代理训练仍受限。本文介绍了 \textbf{Agent-World}，这是一个自我演进的培训场域，旨在通过可扩展环境推进通用智能体智能。代理世界有两个主要组成部分：（1）代理环境任务发现，自主探索数千个真实环境主题的主题对齐数据库和可执行工具生态系统，并以可控难度综合可验证任务;以及（2）持续自我演化的代理训练，结合多环境强化学习与自演进代理领域，通过动态任务综合自动识别能力缺口并驱动有针对性学习，实现代理策略和环境的共演进。在23项具有挑战性的代理基准测试中，代理-世界-8B和14B始终优于强大的专有模型和环境扩展基线。进一步分析揭示了环境多样性和自我进化轮次相关的尺度趋势，为构建通用智能体智能提供了见解。

Training and Agentic Inference Strategies for LLM-based Manim Animation Generation

基于LLM的Manim动画生成的训练与代理推理策略

Authors: Ravidu Suien Rammuni Silva, Ahmad Lotfi, Isibor Kennedy Ihianle, Golnaz Shahtahmassebi, Jordan J. Bird
Subjects: Subjects: Artificial Intelligence (cs.AI); Graphics (cs.GR); Multiagent Systems (cs.MA)
Arxiv link: https://arxiv.org/abs/2604.18364
Pdf link: https://arxiv.org/pdf/2604.18364
Abstract Generating programmatic animation using libraries such as Manim presents unique challenges for Large Language Models (LLMs), requiring spatial reasoning, temporal sequencing, and familiarity with domain-specific APIs that are underrepresented in general pre-training data. A systematic study of how training and inference strategies interact in this setting is lacking in current research. This study introduces ManimTrainer, a training pipeline that combines Supervised Fine-tuning (SFT) with Reinforcement Learning (RL) based Group Relative Policy Optimisation (GRPO) using a unified reward signal that fuses code and visual assessment signals, and ManimAgent, an inference pipeline featuring Renderer-in-the-loop (RITL) and API documentation-augmented RITL (RITL-DOC) strategies. Using these techniques, this study presents the first unified training and inference study for text-to-code-to-video transformation with Manim. It evaluates 17 open-source sub-30B LLMs across nine combinations of training and inference strategies using ManimBench. Results show that SFT generally improves code quality, while GRPO enhances visual outputs and increases the models' responsiveness to extrinsic signals during self-correction at inference time. The Qwen 3 Coder 30B model with GRPO and RITL-DOC achieved the highest overall performance, with a 94% Render Success Rate (RSR) and 85.7% Visual Similarity (VS) to reference videos, surpassing the baseline GPT-4.1 model by +3 percentage points in VS. Additionally, the analysis shows that the correlation between code and visual metrics strengthens with SFT and GRPO but weakens with inference-time enhancements, highlighting the complementary roles of training and agentic inference strategies in Manim animation generation.
中文摘要 使用Manim等库生成程序动画对大型语言模型（LLMs）来说是独特的挑战，需要空间推理、时间序列以及对领域特定API的熟悉，而这些API在一般预训练数据中往往被低估。当前研究缺乏对训练与推理策略在该环境中相互作用的系统性研究。本研究介绍了ManimTrainer，这是一条结合监督微调（SFT）和基于强化学习（RL）的群体相对策略优化（GRPO）的培训流程，采用统一的奖励信号融合代码与可视化评估信号;以及ManimAgent，一个推理流水线，采用Renderer-in-the-loop（RITL）和API文档增强的RITL（RITL-DOC）策略。利用这些技术，本研究首次采用Manim实现文本到代码再视频转换的统一训练与推断研究。它利用 ManimBench 评估了 17 个开源的 sub-30B LLM，采用九种训练和推理策略组合。结果显示，SFT通常能提升代码质量，而GRPO则增强视觉输出，并提高模型在推断时自我纠正时对外部信号的响应能力。Qwen 3 Coder 30B 模型配合 GRPO 和 RITL-DOC 实现了最高的整体性能，在参考视频中呈现出 94% 的渲染成功率（RSR）和 85.7% 的视觉相似度（VS），在 VS 中比基础 GPT-4.1 模型高出 +3 个百分点。此外，分析显示，代码与视觉指标的相关性在SFT和GRPO中增强，但在推理时间增强时减弱，凸显了训练与代理推理策略在Manim动画生成中的互补作用。

Learning from Less: Measuring the Effectiveness of RLVR in Low Data and Compute Regimes

从更少中学习：衡量RLVR在低数据和计算体系中的有效性

Authors: Justin Bauer, Thomas Walshe, Derek Pham, Harit Vishwakarma, Armin Parchami, Frederic Sala, Paroma Varma
Subjects: Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2604.18381
Pdf link: https://arxiv.org/pdf/2604.18381
Abstract Fine-tuning Large Language Models (LLMs) typically relies on large quantities of high-quality annotated data, or questions with well-defined ground truth answers in the case of Reinforcement Learning with Verifiable Rewards (RLVR). While previous work has explored the benefits to model reasoning capabilities by scaling both data and compute used for RLVR, these results lack applicability in many real-world settings where annotated data and accessible compute may be scarce. In this work, we present a comprehensive empirical study of open-source Small Language Model (SLM) performance after RLVR in low data regimes. Across three novel datasets covering number counting problems, graph reasoning, and spatial reasoning, we characterize how model performance scales with dataset size, diversity, and complexity. We demonstrate that (1) procedural datasets allow for fine-grained evaluation and training dataset development with controllable properties (size, diversity, and complexity), (2) under RLVR, models trained on lower complexity tasks can generalize to higher complexity tasks, and (3) training on mixed complexity datasets is associated with the greatest benefits in low data regimes, providing up to 5x sample efficiency versus training on easy tasks. These findings inspire future work on the development of data scaling laws for RLVR and the use of procedural data generators to further understand effective data development for efficient LLM fine-tuning.
中文摘要 微调大型语言模型（LLM）通常依赖大量高质量的注释数据，或者在可验证奖励强化学习（RLVR）中具有明确的真实答案的问题。虽然此前的研究探讨了通过同时扩展RLVR数据和计算对模型推理能力的益处，但这些结果在许多注释数据和可访问计算资源稀缺的现实环境中并不适用。本研究提出了一项关于RLVR后开源小语言模型（SLM）在低数据环境中表现的综合实证研究。通过三个涵盖数字计数问题、图推理和空间推理的新数据集，我们描述了模型性能如何随数据集大小、多样性和复杂性而扩展。我们证明：（1）过程式数据集允许细粒度的评估和训练数据集开发，具有可控属性（大小、多样性和复杂度），（2）在RLVR下，训练于低复杂度任务的模型可以推广到高复杂度任务，（3）混合复杂度数据集训练在低数据环境中带来最大优势，样本效率高达5倍，而训练简单任务时则高达5倍。这些发现激励了未来在RLVR数据尺度定律开发及使用程序式数据生成器以进一步理解有效数据开发以实现高效LLM微调的研究。

OpenGame: Open Agentic Coding for Games

OpenGame：游戏中的开放代理编码

Authors: Yilei Jiang, Jinyuan Hu, Qianyin Xiao, Yaozhi Zheng, Ruize Ma, Kaituo Feng, Jiaming Han, Tianshuo Peng, Kaixuan Fan, Manyuan Zhang, Xiangyu Yue
Subjects: Subjects: Software Engineering (cs.SE)
Arxiv link: https://arxiv.org/abs/2604.18394
Pdf link: https://arxiv.org/pdf/2604.18394
Abstract Game development sits at the intersection of creative design and intricate software engineering, demanding the joint orchestration of game engines, real-time loops, and tightly coupled state across many files. While Large Language Models (LLMs) and code agents now solve isolated programming tasks with ease, they consistently stumble when asked to produce a fully playable game from a high-level design, collapsing under cross-file inconsistencies, broken scene wiring, and logical incoherence. We bridge this gap with OpenGame, the first open-source agentic framework explicitly designed for end-to-end web game creation. At its core lies Game Skill, a reusable, evolving capability composed of a Template Skill that grows a library of project skeletons from experience and a Debug Skill that maintains a living protocol of verified fixes - together enabling the agent to scaffold stable architectures and systematically repair integration errors rather than patch isolated syntax bugs. Powering this framework is GameCoder-27B, a code LLM specialized for game engine mastery through a three-stage pipeline of continual pre-training, supervised fine-tuning, and execution-grounded reinforcement learning. Since verifying interactive playability is fundamentally harder than checking static code, we further introduce OpenGame-Bench, an evaluation pipeline that scores agentic game generation along Build Health, Visual Usability, and Intent Alignment via headless browser execution and VLM judging. Across 150 diverse game prompts, OpenGame establishes a new state-of-the-art. We hope OpenGame pushes code agents beyond discrete software engineering problems and toward building complex, interactive real-world applications. Our framework will be fully open-sourced.
中文摘要 游戏开发处于创意设计与复杂软件工程的交汇点，要求游戏引擎、实时循环以及多个文件间紧密耦合状态的联合编排。虽然大型语言模型（LLM）和代码代理现在可以轻松解决孤立的编程任务，但在要求从高层设计中生成完全可玩的游戏时，它们经常因跨文件不一致、场景布线混乱和逻辑不连贯而崩溃。我们通过OpenGame弥合这一鸿沟，OpenGame是首个专门为端到端网页游戏制作设计的开源代理框架。其核心是游戏技能，这是一项可重用、不断演进的能力，由模板技能（根据经验构建项目骨架库）和调试技能（维护经过验证修复的活协议）组成——共同使代理能够搭建稳定架构并系统性修复集成错误，而非仅仅修补孤立的语法漏洞。支撑该框架的是GameCoder-27B，一款专注于通过三阶段流程——持续预训练、监督微调和执行基础强化学习——进行游戏引擎掌握的代码LLM。由于验证交互可玩性本质上比检查静态代码更难，我们进一步引入了OpenGame-Bench评估流程，通过无头浏览器执行和VLM判断，对代理游戏生成的构建健康度、视觉可用性和意图对齐进行评分。通过150个多样化的游戏提示，OpenGame建立了全新的技术水平。我们希望OpenGame能推动代码代理超越离散的软件工程问题，构建复杂且互动的现实应用。我们的框架将完全开源。

StepPO: Step-Aligned Policy Optimization for Agentic Reinforcement Learning

StepPO：代理强化学习的步进对齐策略优化

Authors: Daoyu Wang, Qingchuan Li, Mingyue Cheng, Jie Ouyang, Shuo Yu, Qi Liu, Enhong Chen
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2604.18401
Pdf link: https://arxiv.org/pdf/2604.18401
Abstract General agents have given rise to phenomenal applications such as OpenClaw and Claude Code. As these agent systems (a.k.a. Harnesses) strive for bolder goals, they demand increasingly stronger agentic capabilities from foundation Large Language Models (LLMs). Agentic Reinforcement Learning (RL) is emerging as a central post-training paradigm for empowering LLMs with these capabilities and is playing an increasingly pivotal role in agent training. Unlike single-turn token-level alignment or reasoning enhancement, as in RLHF and RLVR, Agentic RL targets multi-turn interactive settings, where the goal is to optimize core agentic capabilities such as decision making and tool use while addressing new challenges including delayed and sparse rewards, as well as long and variable context. As a result, the token-centric modeling and optimization paradigm inherited from traditional LLM RL is becoming increasingly inadequate for capturing real LLM agent behavior. In this paper, we present StepPO as a position on step-level Agentic RL. We argue that the conventional token-level Markov Decision Process (MDP) should be advanced to a step-level MDP formulation, and that the step, rather than the token, should be regarded as the proper action representation for LLM agents. We then propose step-level credit assignment as the natural optimization counterpart of this formulation, thereby aligning policy optimization and reward propagation with the granularity of agent decisions. Finally, we discuss the key systems designs required to realize step-level Agentic RL in practice and preliminary experiments provide initial evidence for the effectiveness of this perspective. We hope that the step-aligned, step-level paradigm embodied in StepPO offers the Agentic RL community a useful lens for understanding agent behavior and helps advance LLMs toward stronger general-agent capabilities.
中文摘要 通用代理催生了诸如OpenClaw和Claude Code等非凡应用。随着这些代理系统（又称光束）追求更大胆的目标，它们对基础大型语言模型（LLM）要求越来越强大的代理能力。代理强化学习（RL）正逐渐成为赋能LLM这些能力的核心后培训范式，并在代理训练中扮演着越来越关键的角色。与单回合的令牌级对齐或推理增强不同，RLHF和RLVR中，智能强化学习面向多回合交互环境，目标是优化核心智能体能力，如决策和工具使用，同时应对包括延迟和稀疏奖励，以及长且可变上下文等新挑战。因此，传统LLM RL继承的以代币为中心的建模和优化范式，正变得越来越难以捕捉真实LLM代理的行为。本文将StepPO介绍为阶级能动强化学习的一种位置。我们主张，传统的令牌级马尔可夫决策过程（MDP）应被推进到步级MDP表述，并且步而非令牌应被视为LLM代理的适当动作表示。随后，我们提出阶级信用分配作为该表述的自然优化对应，从而使策略优化和奖励传播与代理决策的粒度保持一致。最后，我们讨论了实现步级能动强化学习所需的关键系统设计，初步实验为这一观点的有效性提供了初步证据。我们希望StepPO所体现的步进对齐、步级范式，能为代理强化学习社区提供理解代理行为的有用视角，并助力LLM向更强的通用代理能力迈进。

Knowing When to Quit: A Principled Framework for Dynamic Abstention in LLM Reasoning

知道何时该退出：LLM推理中动态戒断的原则框架

Authors: Hen Davidov, Nachshon Cohen, Oren Kalinsky, Yaron Fairstein, Guy Kushilevitz, Ram Yazdi, Patrick Rebeschini
Subjects: Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2604.18419
Pdf link: https://arxiv.org/pdf/2604.18419
Abstract Large language models (LLMs) using chain-of-thought reasoning often waste substantial compute by producing long, incorrect responses. Abstention can mitigate this by withholding outputs unlikely to be correct. While most abstention methods decide to withhold outputs before or after generation, dynamic mid-generation abstention considers early termination of unpromising reasoning traces at each token position. Prior work has explored empirical variants of this idea, but principled guidance for the abstention rule remains lacking. We present a formal analysis of dynamic abstention for LLMs, modeling abstention as an explicit action within a regularized reinforcement learning framework. An abstention reward parameter controls the trade-off between compute and information. We show that abstaining when the value function falls below this reward strictly outperforms natural baselines under general conditions. We further derive a principled and efficient method to approximate the value function. Empirical results on mathematical reasoning and toxicity avoidance tasks support our theory and demonstrate improved selective accuracy over existing methods.
中文摘要 使用思维链推理的大型语言模型（LLM）常常因产生冗长且错误的回答而浪费大量计算。弃权可以通过保留不太可能正确的产出来缓解这一问题。大多数弃权方法决定在生成前或生成后保留输出，而动态中代弃权则考虑在每个标记位置提前终止无望推理迹。此前已有研究探讨过这一观点的实证变体，但关于弃权规则的原则性指导仍然缺乏。我们提出了大型语言模型动态隐匿的形式分析，将隐匿建模为正则化强化学习框架下的显式动作。一个戒除奖励参数控制计算与信息之间的权衡。我们证明，当价值函数低于该奖励时，禁欲在一般条件下严格优于自然基线。我们进一步推导出一种原则性且高效的方法来近似该价值函数。数学推理和毒性避免任务的实证结果支持了我们的理论，并展示了比现有方法更高的选择准确率。

Train Separately, Merge Together: Modular Post-Training with Mixture-of-Experts

分别培训，合并：专家组合的模块化培训后

Authors: Jacob Morrison, Sanjay Adhikesaven, Akshita Bhagia, Matei Zaharia, Noah A. Smith, Sewon Min
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2604.18473
Pdf link: https://arxiv.org/pdf/2604.18473
Abstract Extending a fully post-trained language model with new domain capabilities is fundamentally limited by monolithic training paradigms: retraining from scratch is expensive and scales poorly, while continued training often degrades existing capabilities. We present BAR (Branch-Adapt-Route), which trains independent domain experts, each through its own mid-training, supervised finetuning, and reinforcement learning pipeline, and composes them via a Mixture-of-Experts architecture with lightweight router training. Unlike retraining approaches that mix all domains and require full reprocessing for any update (with cost scaling quadratically), BAR enables updating individual experts independently with linear cost scaling and no degradation to existing domains. At the 7B scale, with experts for math, code, tool use, and safety, BAR achieves an overall score of 49.1 (averaged across 7 evaluation categories), matching or exceeding re-training baselines (47.8 without mid-training, 50.5 with). We further show that modular training provides a structural advantage: by isolating each domain, it avoids the catastrophic forgetting that occurs when late-stage RL degrades capabilities from earlier training stages, while significantly reducing the cost and complexity of updating or adding a domain. Together, these results suggest that decoupled, expert-based training is a scalable alternative to monolithic retraining for extending language models.
中文摘要 将完全后训练的语言模型扩展到具备新领域能力，根本受限于单一的训练范式：从零开始重新训练成本高且扩展性差，而持续训练往往会削弱现有能力。我们介绍BAR（分支-适应-路径），通过各自的中期培训、监督微调和强化学习流程培训独立领域专家，并通过专家混合架构和轻量级路由器培训进行组合。与混合所有领域并要求每次更新都需完全重新处理（成本按二次方增长）的再培训方法不同，BAR允许独立更新单个专家，采用线性成本扩展，且不对现有领域造成降级。在7B量表下，数学、代码、工具使用和安全专家，BAR总分为49.1（跨7个评估类别平均），匹配或超过再培训基线（未中期培训时47.8，含中期培训50.5）。我们还进一步证明，模块化训练具有结构优势：通过隔离每个领域，避免了后期强化学习导致早期训练阶段能力下降时发生的灾难性遗忘，同时显著降低了更新或添加领域的成本和复杂性。综合来看，这些结果表明，解耦的专家式训练是扩展语言模型的可扩展替代方案，取代单一再训练。

XEmbodied: A Foundation Model with Enhanced Geometric and Physical Cues for Large-Scale Embodied Environments

XEmbodied：一个基于大型具身环境增强几何和物理线索的基础模型

Authors: Kangan Qian, ChuChu Xie, Yang Zhong, Jingrui Pang, Siwen Jiao, Sicong Jiang, Zilin Huang, Yunlong Wang, Kun Jiang, Mengmeng Yang, Hao Ye, Guanghao Zhang, Hangjun Ye, Guang Chen, Long Chen, Diange Yang
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2604.18484
Pdf link: https://arxiv.org/pdf/2604.18484
Abstract Vision-Language-Action (VLA) models drive next-generation autonomous systems, but training them requires scalable, high-quality annotations from complex environments. Current cloud pipelines rely on generic vision-language models (VLMs) that lack geometric reasoning and domain semantics due to their 2D image-text pretraining. To address this mismatch, we propose XEmbodied, a cloud-side foundation model that endows VLMs with intrinsic 3D geometric awareness and interaction with physical cues (e.g., occupancy grids, 3D boxes). Instead of treating geometry as auxiliary input, XEmbodied integrates geometric representations via a structured 3D Adapter and distills physical signals into context tokens using an Efficient Image-Embodied Adapter. Through progressive domain curriculum and reinforcement learning post-training, XEmbodied preserves general capabilities while demonstrating robust performance across 18 public benchmarks. It significantly improves spatial reasoning, traffic semantics, embodied affordance, and out-of-distribution generalization for large-scale scenario mining and embodied VQA.
中文摘要 视觉-语言-行动（VLA）模型驱动下一代自主系统，但训练它们需要来自复杂环境中的可扩展、高质量注释。当前的云流水线依赖于通用的视觉语言模型（VLM），这些模型由于二维图像文本预训练，缺乏几何推理和领域语义。为解决这一不匹配，我们提出了XEmbodied，一种云端基础模型，赋予VLM内在的三维几何感知能力，并能与物理线索（如占用网格、三维盒子）互动。XEmbodieed 不将几何视为辅助输入，而是通过结构化的 3D 适配器整合几何表示，并利用高效的图像嵌入适配器将物理信号提炼成上下文标记。通过渐进式领域课程和培训后强化学习，XEmbodied 保留了通用能力，同时在 18 个公共基准测试中展现出强劲的性能。它显著提升了大规模场景挖掘和具象VQA的空间推理、流量语义、具象可供性以及分布外泛化。

Too Correct to Learn: Reinforcement Learning on Saturated Reasoning Data

过于正确而难以学习：饱和推理数据上的强化学习

Authors: Zhenwen Liang, Yujun Zhou, Sidi Lu, Xiangliang Zhang, Haitao Mi, Dong Yu
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2604.18493
Pdf link: https://arxiv.org/pdf/2604.18493
Abstract Reinforcement Learning (RL) enhances LLM reasoning, yet a paradox emerges as models scale: strong base models saturate standard benchmarks (e.g., MATH), yielding correct but homogeneous solutions. In such environments, the lack of failure cases causes the advantage signal in group-relative algorithms (e.g., GRPO) to vanish, driving policies into mode collapse. To address this, we propose Constrained Uniform Top-K Sampling (CUTS), a parameter-free decoding strategy enforcing structure-preserving exploration. Unlike standard sampling that follows model biases, CUTS flattens the local optimization landscape by sampling uniformly from constrained high-confidence candidates. We integrate this into Mixed-CUTS, a training framework synergizing exploitative and exploratory rollouts to amplify intra-group advantage variance. Experiments on Qwen3 models demonstrate that our approach prevents policy degeneration and significantly boosts out-of-domain generalization. Notably, Mixed-CUTS improves Pass@1 accuracy on the challenging AIME25 benchmark by up to 15.1% over standard GRPO, validating that maintaining diversity within the semantic manifold is critical for rigorous reasoning.
中文摘要 强化学习（RL）增强了大型语言模型的推理能力，但随着模型规模扩大，出现了一个悖论：强基准模型充斥标准基准测试（如数学），从而产生正确但同质的解。在这种情况下，缺乏失败案例会导致群相对算法（如GRPO）中的优势信号消失，导致策略进入模式崩溃。为此，我们提出了受限统一Top-K采样（CUTS），这是一种无参数的解码策略，强制执行结构保持的探索。与遵循模型偏差的标准抽样不同，CUTS通过从受限的高置信度候选对象中均匀抽样，使局部优化景观趋于平整。我们将此整合进Mixed-CUTS，这是一个训练框架，协同利用性与探索性推广，以放大群体内优势方差。Qwen3模型的实验表明，我们的方法能够防止策略退化，并显著提升域外泛化能力。值得注意的是，Mixed-CUTS在具有挑战性的AIME25基准测试中，Pass@1准确率比标准GRPO提高了多达15.1%，验证了在语义流形内保持多样性对于严谨推理至关重要。

Different Paths to Harmful Compliance: Behavioral Side Effects and Mechanistic Divergence Across LLM Jailbreaks

有害合规的不同路径：大型语言模型越狱的行为副作用与机制性分歧

Authors: Md Rysul Kabir, Zoran Tiganj
Subjects: Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2604.18510
Pdf link: https://arxiv.org/pdf/2604.18510
Abstract Open-weight language models can be rendered unsafe through several distinct interventions, but the resulting models may differ substantially in capabilities, behavioral profile, and internal failure mode. We study behavioral and mechanistic properties of jailbroken models across three unsafe routes: harmful supervised fine-tuning (SFT), harmful reinforcement learning with verifiable rewards (RLVR), and refusal-suppressing abliteration. All three routes achieve near-ceiling harmful compliance, but they diverge once we move beyond direct harmfulness. RLVR-jailbroken models show minimal degradation and preserve explicit harm recognition in a structured self-audit: they are able to identify harmful prompts and describe how a safe LLM should respond, yet they comply with the harmful request. With RLVR, harmful behavior is strongly suppressed by a reflective safety scaffold: when a harmful prompt is prepended with an instruction to reflect on safety standards, harmful behavior drops close to the baseline. Category-specific RLVR jailbreaks generalize broadly across harmfulness domains. Models jailbroken with SFT show the largest collapse in explicit safety judgments, the highest behavioral drift, and a substantial capability loss on standard benchmarks. Abliteration is family-dependent in both self-audit and response to a reflective safety scaffold. Mechanistic and repair analyses further separate the routes: abliteration is consistent with localized refusal-feature deletion, RLVR with preserved safety geometry but retargeted policy behavior, and SFT with broader distributed drift. Targeted repair partially recovers RLVR-jailbroken models, but has little effect on SFT-jailbroken models. Together, these results show that jailbreaks can produce vastly different properties despite similar harmfulness, with models jailbroken via RLVR showing remarkable similarity to the base model.
中文摘要 开放权重语言模型可以通过多种不同的干预措施变得不安全，但最终产生的模型在能力、行为特征和内部故障模式上可能存在显著差异。我们研究越狱模型的行为和机制特性，涵盖三种不安全路径：有害监督微调（SFT）、带可验证奖励的有害强化学习（RLVR）和拒绝抑制消除。这三条路径都达到了接近上限的有害顺从，但一旦超越直接伤害，它们就会分开。RLVR越狱模型在结构化自我审计中表现出最小的降级，并保持了明确的伤害识别：它们能够识别有害提示并描述安全的大型语言模型应如何响应，同时又能遵守有害请求。在RLVR中，有害行为通过反思性安全支架被强力抑制：当有害提示前附带反思安全标准的指示时，有害行为会接近基线。类别特定的RLVR越狱在有害性领域中广泛推广。采用SFT越狱的模型显示出显式安全判断的最大崩溃，行为漂移最高，且在标准基准测试中能力损失显著。消除在自我审计和对反思安全支架的响应上都依赖家庭。机制分析和修复分析进一步区分了这些路径：消音与局部拒绝特征缺失一致，RLVR保留安全几何但策略行为重新定向，SFT则与更广泛的分布漂移相符。有针对性修复部分恢复了被 RLVR 越狱的模型，但对 SFT 越狱的模型影响不大。综合来看，越狱尽管危害相似，但其特性却有极大差异，通过RLVR越狱的模型与基础模型有显著相似性。

UDM-GRPO: Stable and Efficient Group Relative Policy Optimization for Uniform Discrete Diffusion Models

UDM-GRPO：均匀离散扩散模型的稳定高效群相对策略优化

Authors: Jiaqi Wang (1 and 2), Haoge Deng (2), Ting Pan (2), Yang Liu (2), Chengyuan Wang (2), Fan Zhang (2), Yonggang Qi (1), Xinlong Wang (2) ((1) Beijing University of Posts and Telecommunications, (2) Beijing Academy of Artificial Intelligence)
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2604.18518
Pdf link: https://arxiv.org/pdf/2604.18518
Abstract Uniform Discrete Diffusion Model (UDM) has recently emerged as a promising paradigm for discrete generative modeling; however, its integration with reinforcement learning remains largely unexplored. We observe that naively applying GRPO to UDM leads to training instability and marginal performance gains. To address this, we propose \Ours, the first framework to integrate UDM with RL. Our method is guided by two key insights: (i) treating the final clean sample as the action provides more accurate and stable optimization signals; and (ii) reconstructing trajectories via the diffusion forward process better aligns probability paths with the pretraining distribution. Additionally, we introduce two strategies, Reduced-Step and CFG-Free, to further improve training efficiency. \Ours significantly improves base model performance across multiple T2I tasks. Notably, GenEval accuracy improves from $69\%$ to $96\%$ and PickScore increases from $20.46$ to $23.81$, achieving state-of-the-art performance in both continuous and discrete settings. On the OCR benchmark, accuracy rises from $8\%$ to $57\%$, further validating the generalization ability of our method. Code is available at \href{this https URL}{this https URL}.
中文摘要 统一离散扩散模型（UDM）最近成为离散生成建模的有前景范式;然而，其与强化学习的整合仍大多未被充分探索。我们观察到，天真地将GRPO应用于UDM会导致训练不稳定和边际性能提升。为此，我们提出了 \Ours，这是首个将 UDM 与强化学习集成的框架。我们的方法基于两个关键见解：（i）将最终干净的样品视为动作，从而提供更准确、更稳定的优化信号;以及（ii）通过扩散前向过程重建轨迹，使概率路径更好地与预训练分布对齐。此外，我们还引入了两种策略：减步和无碳化组，进一步提升训练效率。\Ours 的方案显著提升了多个 T2I 任务中的基础模型性能。值得注意的是，GenEval的准确率从$69%$提升到$96%，PickScore从$20.46$提升到$23.81$，在连续和离散环境中都实现了最先进的性能。在OCR基准测试中，准确率从8美元提升至57%美元，进一步验证了我们方法的泛化能力。代码可在 \href{this https URL}{this https URL} 获取。

OGER: A Robust Offline-Guided Exploration Reward for Hybrid Reinforcement Learning

OGER：一个强有力的离线引导探索奖励，用于混合强化学习

Authors: Xinyu Ma, Mingzhou Xu, Xuebo Liu, Chang Jin, Qiang Wang, Derek F. Wong, Min Zhang
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2604.18530
Pdf link: https://arxiv.org/pdf/2604.18530
Abstract Recent advancements in Reinforcement Learning with Verifiable Rewards (RLVR) have significantly improved Large Language Model (LLM) reasoning, yet models often struggle to explore novel trajectories beyond their initial latent space. While offline teacher guidance and entropy-driven strategies have been proposed to address this, they often lack deep integration or are constrained by the model's inherent capacity. In this paper, we propose OGER, a novel framework that unifies offline teacher guidance and online reinforcement learning through a specialized reward modeling lens. OGER employs multi-teacher collaborative training and constructs an auxiliary exploration reward that leverages both offline trajectories and the model's own entropy to incentivize autonomous exploration. Extensive experiments across mathematical and general reasoning benchmarks demonstrate that OGER significantly outperforms competitive baselines, achieving substantial gains in mathematical reasoning while maintaining robust generalization to out-of-domain tasks. We provide a comprehensive analysis of training dynamics and conduct detailed ablation studies to validate the effectiveness of our entropy-aware reward modulation. Our code is available at this https URL.
中文摘要 可验证奖励强化学习（RLVR）的最新进展显著提升了大型语言模型（LLM）推理能力，但模型常常难以探索超越初始潜伏空间的新轨迹。虽然已有线下教师指导和熵驱动策略被提出来解决这个问题，但它们往往缺乏深度整合，或受限于模型的固有能力。本文提出了OGER新框架，通过专门的奖励建模视角统一线下教师指导与在线强化学习。OGER采用多教师协作培训，构建辅助探索奖励，利用离线轨迹和模型自身的熵来激励自主探索。跨数学和通用推理基准的广泛实验表明，OGER显著优于竞争基线，在数学推理方面取得了显著进步，同时保持了对域外任务的稳健推广能力。我们对训练动态进行了全面分析，并进行详细的消融研究，以验证我们熵感知奖励调节的有效性。我们的代码可在此 https URL 访问。

When Can LLMs Learn to Reason with Weak Supervision?

大型语言模型（LLM）什么时候能学会在弱监督下进行推理？

Authors: Salman Rahman, Jingyan Shen, Anna Mordvina, Hamid Palangi, Saadia Gabriel, Pavel Izmailov
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2604.18574
Pdf link: https://arxiv.org/pdf/2604.18574
Abstract Large language models have achieved significant reasoning improvements through reinforcement learning with verifiable rewards (RLVR). Yet as model capabilities grow, constructing high-quality reward signals becomes increasingly difficult, making it essential to understand when RLVR can succeed under weaker forms of supervision. We conduct a systematic empirical study across diverse model families and reasoning domains under three weak supervision settings: scarce data, noisy rewards, and self-supervised proxy rewards. We find that generalization is governed by training reward saturation dynamics: models that generalize exhibit a prolonged pre-saturation phase during which training reward and downstream performance climb together, while models that saturate rapidly memorize rather than learn. We identify reasoning faithfulness, defined as the extent to which intermediate steps logically support the final answer, as the pre-RL property that predicts which regime a model falls into, while output diversity alone is uninformative. Motivated by these findings, we disentangle the contributions of continual pre-training and supervised fine-tuning, finding that SFT on explicit reasoning traces is necessary for generalization under weak supervision, while continual pre-training on domain data amplifies the effect. Applied together to Llama3.2-3B-Base, these interventions enable generalization across all three settings where the base model previously failed.
中文摘要 大型语言模型通过可验证奖励的强化学习（RLVR）实现了显著的推理改进。然而，随着模型能力的提升，构建高质量的奖励信号变得越来越困难，因此理解RLVR在较弱监督形式下何时能成功变得至关重要。我们对不同模型家族和推理领域进行了系统实证研究，涵盖三种弱监督环境：稀缺数据、噪声奖励和自监督代理奖励。我们发现泛化受训练奖励饱和动态支配：泛化模型表现出较长的前饱和阶段，在此期间训练奖励和下游表现同步攀升，而饱和模型则快速记忆而非学习。我们将推理忠实度（定义为中间步骤在逻辑上支持最终答案的程度）认定为预测模型处于哪个状态的前强化学习特性，而仅输出多样性则无参考价值。基于这些发现，我们理清了持续预训练和监督微调的贡献，发现在弱监督下，显式推理迹上的SFT对于泛化是必要的，而对域数据的持续预训练则增强了这一效应。将这些干预措施应用于Llama3.2-3B-Base，使得在基础模型之前失败的三个场景中实现了推广。

Bounded Ratio Reinforcement Learning

有界比率强化学习

Authors: Yunke Ao, Le Chen, Bruce D. Lee, Assefa S. Wahd, Aline Czarnobai, Philipp Fürnstahl, Bernhard Schölkopf, Andreas Krause
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2604.18578
Pdf link: https://arxiv.org/pdf/2604.18578
Abstract Proximal Policy Optimization (PPO) has become the predominant algorithm for on-policy reinforcement learning due to its scalability and empirical robustness across domains. However, there is a significant disconnect between the underlying foundations of trust region methods and the heuristic clipped objective used in PPO. In this paper, we bridge this gap by introducing the Bounded Ratio Reinforcement Learning (BRRL) framework. We formulate a novel regularized and constrained policy optimization problem and derive its analytical optimal solution. We prove that this solution ensures monotonic performance improvement. To handle parameterized policy classes, we develop a policy optimization algorithm called Bounded Policy Optimization (BPO) that minimizes an advantage-weighted divergence between the policy and the analytic optimal solution from BRRL. We further establish a lower bound on the expected performance of the resulting policy in terms of the BPO loss function. Notably, our framework also provides a new theoretical lens to interpret the success of the PPO loss, and connects trust region policy optimization and the Cross-Entropy Method (CEM). We additionally extend BPO to Group-relative BPO (GBPO) for LLM fine-tuning. Empirical evaluations of BPO across MuJoCo, Atari, and complex IsaacLab environments (e.g., Humanoid locomotion), and of GBPO for LLM fine-tuning tasks, demonstrate that BPO and GBPO generally match or outperform PPO and GRPO in stability and final performance.
中文摘要 由于其可扩展性和跨域的经验鲁棒性，近端策略优化（PPO）已成为策略强化学习的主流算法。然而，信任区域方法的基础与PPO中使用的启发式截获目标之间存在显著脱节。本文通过引入有界比率强化学习（BRRL）框架来弥合这一差距。我们提出了一个新的正则化和约束策略优化问题，并推导出其解析最优解。我们证明该方案确保了单调性能的提升。为处理参数化策略类，我们开发了一种名为有界策略优化（BPO）的策略优化算法，以最小化策略与BRRL分析最优解之间的优势加权偏离。我们还进一步用BPO损失函数设定了对最终保单预期表现的下限。值得注意的是，我们的框架还提供了新的理论视角来解释PPO损失的成功，并将信任区域策略优化与交叉熵方法（CEM）联系起来。我们还将BPO（业务流程外包）扩展到群组相对BPO（GBPO），用于LLM的微调。对MuJoCo、Atari及复杂IsaacLab环境（如类人移动）的BPO实证评估，以及GBPO在LLM微调任务中的评估表明，BPO和GBPO在稳定性和最终性能方面通常能匹敌甚至超过PPO和GRPO。

Keyword: diffusion policy

There is no result