Arxiv Papers of Today

生成时间: 2026-04-03 16:58:40 (UTC+8); Arxiv 发布时间: 2026-04-03 20:00 EDT (2026-04-04 08:00 UTC+8)

今天共有 36 篇相关文章

Keyword: reinforcement learning

Trustworthy AI-Driven Dynamic Hybrid RIS: Joint Optimization and Reward Poisoning-Resilient Control in Cognitive MISO Networks

可信赖的AI驱动动态混合RIS：认知MISO网络中的联合优化与奖励中毒韧性控制

Authors: Deemah H. Tashman, Soumaya Cherkaoui
Subjects: Subjects: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2604.01238
Pdf link: https://arxiv.org/pdf/2604.01238
Abstract Cognitive radio networks (CRNs) are a key mechanism for alleviating spectrum scarcity by enabling secondary users (SUs) to opportunistically access licensed frequency bands without harmful interference to primary users (PUs). To address unreliable direct SU links and energy constraints common in next-generation wireless networks, this work introduces an adaptive, energy-aware hybrid reconfigurable intelligent surface (RIS) for underlay multiple-input single-output (MISO) CRNs. Distinct from prior approaches relying on static RIS architectures, our proposed RIS dynamically alternates between passive and active operation modes in real time according to harvested energy availability. We also model our scenario under practical hardware impairments and cascaded fading channels. We formulate and solve a joint transmit beamforming and RIS phase optimization problem via the soft actor-critic (SAC) deep reinforcement learning (DRL) method, leveraging its robustness in continuous and highly dynamic environments. Notably, we conduct the first systematic study of reward poisoning attacks on DRL agents in RIS-enhanced CRNs, and propose a lightweight, real-time defense based on reward clipping and statistical anomaly filtering. Numerical results demonstrate that the SAC-based approach consistently outperforms established DRL baselines, and that the dynamic hybrid RIS strikes a superior trade-off between throughput and energy consumption compared to fully passive and fully active alternatives. We further show the effectiveness of our defense in maintaining SU performance even under adversarial conditions. Our results advance the practical and secure deployment of RIS-assisted CRNs, and highlight crucial design insights for energy-constrained wireless systems.
中文摘要 认知无线网络（CRN）是缓解频谱稀缺性的关键机制，通过使次级用户（SU）能够机会性地访问持牌频段，而不会对主用户（PU）造成有害干扰。为解决下一代无线网络中常见的不可靠直接单向链路和能量约束，本研究引入了一种自适应、能能感知的混合可重构智能表面（RIS），用于底层多输入单输出（MISO）CRN。与以往依赖静态RIS架构的方法不同，我们提出的RIS根据采集的能量可用性动态切换，实时在被动和主动操作模式间切换。我们还在实际硬件故障和级联衰落通道下建模我们的情景。我们通过软演员-批判者（SAC）深度强化学习（DRL）方法，构建并解决了联合发射波束成形与RIS相位优化问题，利用其在连续和高度动态环境中的鲁棒性。值得注意的是，我们首次系统性研究了RIS增强CRN中DRL代理的奖励中毒攻击，并提出了基于奖励剪裁和统计异常过滤的轻量级实时防御方法。数值结果表明，基于SAC的方法始终优于既有的日程学习（DRL）基线，动态混合RIS在吞吐量和能耗之间比全被动和全主动方案更为有利。我们进一步展示了防御在敌对条件下保持 SU 性能的有效性。我们的成果推动了RIS辅助CRN的实用且安全部署，并突出了对节能无线系统设计中的关键见解。

Scaling Reasoning Tokens via RL and Parallel Thinking: Evidence From Competitive Programming

通过强化学习和并行思维扩展推理代币：竞技编程的证据

Authors: Qianfan Zhang, Tianyu Guo, Xuandi Ren, Jiale Chen, Ming Ding, Ran Xin, Xia Xiao
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2604.01302
Pdf link: https://arxiv.org/pdf/2604.01302
Abstract We study how to scale reasoning token budgets for competitive programming through two complementary approaches: training-time reinforcement learning (RL) and test-time parallel thinking. During RL training, we observe an approximately log-linear relationship between validation accuracy and the average number of generated reasoning tokens over successive checkpoints, and show two ways to shift this training trajectory: verification RL warmup raises the starting point, while randomized clipping produces a steeper trend in the observed regime. As scaling single-generation reasoning during RL quickly becomes expensive under full attention, we introduce a multi-round parallel thinking pipeline that distributes the token budget across threads and rounds of generation, verification, and refinement. We train the model end-to-end on this pipeline to match the training objective to the test-time structure. Starting from Seed-OSS-36B, the full system with 16 threads and 16 rounds per thread matches the underlying RL model's oracle pass@16 at pass@1 using 7.6 million tokens per problem on average, and surpasses GPT-5-high on 456 hard competitive programming problems from AetherCode.
中文摘要 我们研究如何通过两种互补方法——训练时间强化学习（RL）和测试时间并行思维——来扩展竞争性编程的推理代币预算。在强化学习训练中，我们观察到验证准确率与连续检查点平均产生的推理标记数之间呈对数线性关系，并展示了两种改变训练轨迹的方法：验证强化学习热身提升起点，随机剪裁则使观察到的区域呈现更陡峭的趋势。随着在强化学习中单代推理的扩展迅速变得昂贵，我们引入了多轮并行思维流水线，将代币预算分配到生成、验证和精炼的线程和轮次。我们会在该流程中端到端训练模型，以匹配训练目标与测试时间结构。从Seed-OSS-36B开始，完整系统拥有16个线程和每线程16轮，平均每题使用760万个令牌，达到底层强化模型的oracle pass@1 pass@16，并且在AetherCode的456个难度竞技编程问题上超过GPT-5。

Malliavin Calculus for Counterfactual Gradient Estimation in Adaptive Inverse Reinforcement Learning

Malliavin 演算用于自适应逆强化学习中的反事实梯度估计

Authors: Vikram Krishnamurthy, Luke Snow
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2604.01345
Pdf link: https://arxiv.org/pdf/2604.01345
Abstract Inverse reinforcement learning (IRL) recovers the loss function of a forward learner from its observed responses adaptive IRL aims to reconstruct the loss function of a forward learner by passively observing its gradients as it performs reinforcement learning (RL). This paper proposes a novel passive Langevin-based algorithm that achieves adaptive IRL. The key difficulty in adaptive IRL is that the required gradients in the passive algorithm are counterfactual, that is, they are conditioned on events of probability zero under the forward learner's trajectory. Therefore, naive Monte Carlo estimators are prohibitively inefficient, and kernel smoothing, though common, suffers from slow convergence. We overcome this by employing Malliavin calculus to efficiently estimate the required counterfactual gradients. We reformulate the counterfactual conditioning as a ratio of unconditioned expectations involving Malliavin quantities, thus recovering standard estimation rates. We derive the necessary Malliavin derivatives and their adjoint Skorohod integral formulations for a general Langevin structure, and provide a concrete algorithmic approach which exploits these for counterfactual gradient estimation.
中文摘要 逆向强化学习（IRL）通过观察到的反应恢复正向学习者的损失函数。自适应IRL旨在通过被动观察前向学习者的梯度来重建其在执行强化学习（RL）时的损失函数。本文提出了一种新颖的被动朗之文算法，实现自适应的真实现实生活。自适应IRL的关键难点在于被动算法中所需的梯度是反事实的，即它们基于前向学习者轨迹下概率为零的事件。因此，朴素的蒙特卡洛估计效率极低，核平滑虽然常见，但收敛速度较慢。我们通过运用Malliavin演算高效估计所需的反事实梯度来克服这一问题。我们将反事实条件重新表述为涉及Malliavin量的无条件预期的比率，从而恢复了标准估计率。我们推导了通用朗之文结构所需的Malliavin导数及其伴随的Skorohod积分表述，并提供了一种具体的算法方法，利用这些导数进行反事实梯度估计。

RIFT: A RubrIc Failure Mode Taxonomy and Automated Diagnostics

RIFT：Rubric 失效模式分类法与自动诊断

Authors: Zhengyang Qi, Charles Dickens, Derek Pham, Amanda Dsouza, Armin Parchami, Frederic Sala, Paroma Varma
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2604.01375
Pdf link: https://arxiv.org/pdf/2604.01375
Abstract Rubric-based evaluation is widely used in LLM benchmarks and training pipelines for open-ended, less verifiable tasks. While prior work has demonstrated the effectiveness of rubrics using downstream signals such as reinforcement learning outcomes, there remains no principled way to diagnose rubric quality issues from such aggregated or downstream signals alone. To address this gap, we introduce RIFT: RubrIc Failure mode Taxonomy, a taxonomy for systematically characterizing failure modes in rubric composition and design. RIFT consists of eight failure modes organized into three high-level categories: Reliability Failures, Content Validity Failures, and Consequential Validity Failures. RIFT is developed using grounded theory by iteratively annotating rubrics drawn from five diverse benchmarks spanning general instruction following, code generation, creative writing, and expert-level deep research, until no new failure modes are identified. We evaluate the consistency of the taxonomy by measuring agreement among independent human annotators, observing fair agreement overall (87% pairwise agreement and 0.64 average Cohen's kappa). Finally, to support scalable diagnosis, we propose automated rubric quality metrics and show that they align with human failure-mode annotations, achieving up to 0.86 F1.
中文摘要 基于评分标准的评估广泛应用于大型语言模型基准测试和开放式、难以验证任务的培训流程。尽管以往研究已证明使用下游信号（如强化学习成果）的评分标准有效，但仅凭此类聚合或下游信号，仍无原则性方法诊断评分标准质量问题。为弥补这一空白，我们引入了RIFT：红核失效模式分类法，这是一种用于系统性描述评分标准组成和设计中失效模式的分类法。RIFT由八种失败模式组成，分为三个高级类别：可靠性失败、内容有效性失败和后续有效性失败。RIFT采用基础理论，通过迭代注释来自五个不同基准的评分标准，涵盖一般指令跟踪、代码生成、创意写作和专家级深度研究，直到发现新的失败模式。我们通过测量独立人工注释者之间的一致性来评估分类学的一致性，总体上观察到公平一致（87%的成对一致率，平均0.64的Cohen's kappa）。最后，为了支持可扩展诊断，我们提出了自动化评分标准质量指标，并证明其与人类失败模式注释一致，最高可达0.86 F1。

Residuals-based Offline Reinforcement Learning

基于残差的离线强化学习

Authors: Qing Zhu, Xian Yu
Subjects: Subjects: Machine Learning (cs.LG); Optimization and Control (math.OC)
Arxiv link: https://arxiv.org/abs/2604.01378
Pdf link: https://arxiv.org/pdf/2604.01378
Abstract Offline reinforcement learning (RL) has received increasing attention for learning policies from previously collected data without interaction with the real environment, which is particularly important in high-stakes applications. While a growing body of work has developed offline RL algorithms, these methods often rely on restrictive assumptions about data coverage and suffer from distribution shift. In this paper, we propose a residuals-based offline RL framework for general state and action spaces. Specifically, we define a residuals-based Bellman optimality operator that explicitly incorporates estimation error in learning transition dynamics into policy optimization by leveraging empirical residuals. We show that this Bellman operator is a contraction mapping and identify conditions under which its fixed point is asymptotically optimal and possesses finite-sample guarantees. We further develop a residuals-based offline deep Q-learning (DQN) algorithm. Using a stochastic CartPole environment, we demonstrate the effectiveness of our residuals-based offline DQN algorithm.
中文摘要 离线强化学习（RL）越来越受到关注，因为它基于先前收集的数据进行学习策略，而无需与真实环境交互，这在高风险应用中尤为重要。尽管越来越多的研究开发了离线强化学习算法，但这些方法通常依赖于对数据覆盖范围的限制性假设，并且存在分布偏移的问题。本文提出了基于残差的离线强化学习框架，用于一般状态和动作空间。具体来说，我们定义了一个基于残差的贝尔曼最优算子，通过利用经验残差，明确将学习过渡动力学中的估计误差纳入策略优化。我们证明该贝尔曼算子是一个收缩映射，并识别其不动点在渐近最优且具有有限样本保证的条件下。我们进一步开发了基于残差的离线深度Q学习（DQN）算法。利用随机CartPole环境，我们展示了基于残差的离线DQN算法的有效性。

Improving Latent Generalization Using Test-time Compute

利用测试时计算改进潜在泛化

Authors: Arslan Chaudhry, Sridhar Thiagarajan, Andrew Lampinen
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2604.01430
Pdf link: https://arxiv.org/pdf/2604.01430
Abstract Language Models (LMs) exhibit two distinct mechanisms for knowledge acquisition: in-weights learning (i.e., encoding information within the model weights) and in-context learning (ICL). Although these two modes offer complementary strengths, in-weights learning frequently struggles to facilitate deductive reasoning over the internalized knowledge. We characterize this limitation as a deficit in latent generalization, of which the reversal curse is one example. Conversely, in-context learning demonstrates highly robust latent generalization capabilities. To improve latent generalization from in-weights knowledge, prior approaches rely on train-time data augmentation, yet these techniques are task-specific, scale poorly, and fail to generalize to out-of-distribution knowledge. To overcome these shortcomings, this work studies how models can be taught to use test-time compute, or 'thinking', specifically to improve latent generalization. We use Reinforcement Learning (RL) from correctness feedback to train models to produce long chains-of-thought (CoTs) to improve latent generalization. Our experiments show that this thinking approach not only resolves many instances of latent generalization failures on in-distribution knowledge but also, unlike augmentation baselines, generalizes to new knowledge for which no RL was performed. Nevertheless, on pure reversal tasks, we find that thinking does not unlock direct knowledge inversion, but the generate-and-verify ability of thinking models enables them to get well above chance performance. The brittleness of factual self-verification means thinking models still remain well below the performance of in-context learning for this task. Overall, our results establish test-time thinking as a flexible and promising direction for improving the latent generalization of LMs.
中文摘要 语言模型（LM）展现出两种不同的知识获取机制：内权重学习（即在模型权重中编码信息）和上下文内学习（ICL）。尽管这两种模式互补优势，但内加权学习常常难以促进内化知识的演绎推理。我们将这一限制描述为潜在推广的缺陷，逆转诅咒就是一个例子。相反，情境内学习展现了高度稳健的潜在泛化能力。为了提升权重内知识的潜在泛化，先前的方法依赖训练时间数据增强，但这些技术针对任务特定，扩展性差，且无法推广到分布外知识。为克服这些不足，本研究如何教导模型使用测试时计算，即“思考”，特别是为了提升潜在泛化能力。我们利用从正确性反馈出发的强化学习（RL）训练模型，生成长思考链（CoTs），以提升潜在泛化能力。我们的实验表明，这种思维方法不仅解决了分布内知识中许多潜在泛化失败的情况，而且与增强基线不同，能够推广到未进行强化学习的新知识。然而，在纯逆转任务中，我们发现思维并不能直接实现知识反转，但思维模型的生成和验证能力使其能够远超偶然表现。事实自我验证的脆弱性意味着思考模型在该任务中仍远低于情境学习的性能。总体而言，我们的结果确立了测试时思维作为改进LM潜在泛化的灵活且有前景的方向。

Reinforcing Consistency in Video MLLMs with Structured Rewards

通过结构化奖励强化视频MLLM的一致性

Authors: Yihao Quan, Zeru Shi, Jinman Zhao, Ruixiang Tang
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2604.01460
Pdf link: https://arxiv.org/pdf/2604.01460
Abstract Multimodal large language models (MLLMs) have achieved remarkable progress in video understanding. However, seemingly plausible outputs often suffer from poor visual and temporal grounding: a model may fabricate object existence, assign incorrect attributes, or collapse repeated events while still producing a globally reasonable caption or answer. We study this failure mode through a compositional consistency audit that decomposes a caption into supporting factual and temporal claims, investigating whether a correct high-level prediction is actually backed by valid lower-level evidence. Our top-down audit reveals that even correct root relational claims often lack reliable attribute and existence support. This indicates that standard sentence-level supervision is a weak proxy for faithful video understanding. Furthermore, when turning to reinforcement learning (RL) for better alignment, standard sentence-level rewards often prove too coarse to accurately localize specific grounding failures. To address this, we replace generic sentence-level rewards with a structured reward built from factual and temporal units. Our training objective integrates three complementary components: (1) an instance-aware scene-graph reward for factual objects, attributes, and relations; (2) a temporal reward for event ordering and repetition; and (3) a video-grounded VQA reward for hierarchical self-verification. Across temporal, general video understanding, and hallucination-oriented benchmarks, this objective yields consistent gains on open-source backbones. These results suggest that structured reward shaping is a practical route to more faithful video understanding.
中文摘要 多模态大型语言模型（MLLM）在视频理解方面取得了显著进展。然而，看似合理的输出往往存在视觉和时间上的基础不足：模型可能伪造物体存在、错误属性或崩解重复事件，同时仍能生成全球合理的说明或答案。我们通过成分一致性审计来研究这种失败模式，该审计将标题分解为支持事实和时间的主张，调查一个正确的高层预测是否真的有有效的低层次证据支持。我们的自上而下审计显示，即使是正确的根关系项，也往往缺乏可靠的属性和存在性支持。这表明标准的句子层面监督是忠实视频理解的弱代理。此外，当转向强化学习（RL）以实现更好的对齐时，标准句子级奖励往往过于粗糙，难以准确定位具体的接地失败。为此，我们用由事实和时间单元构建的结构化奖励替代了通用句子级奖励。我们的培训目标整合了三个互补组成部分：（1）针对事实对象、属性和关系的实例感知场景图奖励;（2）事件顺序和重复的时间奖励;以及（3）基于视频的VQA奖励，用于层级自我验证。在时间、一般视频理解和以幻觉为导向的基准测试中，这一目标持续带来开源骨干的提升。这些结果表明，结构化奖励塑造是实现更忠实视频理解的切实途径。

When Reward Hacking Rebounds: Understanding and Mitigating It with Representation-Level Signals

奖励黑客反弹时：理解并利用表征级信号缓解

Authors: Rui Wu, Ruixiang Tang
Subjects: Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2604.01476
Pdf link: https://arxiv.org/pdf/2604.01476
Abstract Reinforcement learning for LLMs is vulnerable to reward hacking, where models exploit shortcuts to maximize reward without solving the intended task. We systematically study this phenomenon in coding tasks using an environment-manipulation setting, where models can rewrite evaluator code to trivially pass tests without solving the task, as a controlled testbed. Across both studied models, we identify a reproducible three-phase rebound pattern: models first attempt to rewrite the evaluator but fail, as their rewrites embed test cases their own solutions cannot pass. They then temporarily retreat to legitimate solving. When legitimate reward remains scarce, they rebound into successful hacking with qualitatively different strategies. Using representation engineering, we extract concept directions for shortcut, deception, and evaluation awareness from domain-general contrastive pairs and find that the shortcut direction tracks hacking behavior most closely, making it an effective representational proxy for detection. Motivated by this finding, we propose Advantage Modification, which integrates shortcut concept scores into GRPO advantage computation to penalize hacking rollouts before policy updates. Because the penalty is internalized into the training signal rather than applied only at inference time, Advantage Modification provides more robust suppression of hacking compared with generation-time activation steering.
中文摘要 LLM的强化学习容易受到奖励黑客攻击的影响，即模型利用捷径以最大化奖励，但不解决预期任务。我们系统地研究了这种在环境操作环境中的编码任务现象，模型可以重写评估器代码，使其轻松通过测试而无需解决任务，作为受控测试平台。在这两种模型中，我们都识别出可重复的三阶段反弹模式：模型首先尝试重写评估器，但失败了，因为其重写嵌入了自身解决方案无法通过的测试用例。然后他们暂时退回正规解谜。当合法奖励仍然稀缺时，他们会以质的不同策略反弹成成功的黑客行为。利用表示工程，我们从领域通用对比对中提取捷径、欺骗和评估意识的概念方向，发现该捷径方向最能紧密追踪黑客行为，使其成为检测的有效表征代理。基于这一发现，我们提出了优势修改，将捷径概念分数整合进GRPO优势计算中，以惩罚政策更新前的黑客推广。由于惩罚被内化到训练信号中，而非仅在推断时施加，优势修改相比世代激活引导提供了更稳健的黑客抑制。

Soft MPCritic: Amortized Model Predictive Value Iteration

软MPCritic：摊销模型预测价值迭代

Authors: Thomas Banker, Nathan P. Lawrence, Ali Mesbah
Subjects: Subjects: Machine Learning (cs.LG); Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2604.01477
Pdf link: https://arxiv.org/pdf/2604.01477
Abstract Reinforcement learning (RL) and model predictive control (MPC) offer complementary strengths, yet combining them at scale remains computationally challenging. We propose soft MPCritic, an RL-MPC framework that learns in (soft) value space while using sample-based planning for both online control and value target generation. soft MPCritic instantiates MPC through model predictive path integral control (MPPI) and trains a terminal Q-function with fitted value iteration, aligning the learned value function with the planner and implicitly extending the effective planning horizon. We introduce an amortized warm-start strategy that recycles planned open-loop action sequences from online observations when computing batched MPPI-based value targets. This makes soft MPCritic computationally practical, while preserving solution quality. soft MPCritic plans in a scenario-based fashion with an ensemble of dynamic models trained for next-step prediction accuracy. Together, these ingredients enable soft MPCritic to learn effectively through robust, short-horizon planning on classic and complex control tasks. These results establish soft MPCritic as a practical and scalable blueprint for synthesizing MPC policies in settings where policy extraction and direct, long-horizon planning may fail.
中文摘要 强化学习（RL）和模型预测控制（MPC）提供了互补的优势，但将它们大规模结合仍然具有计算上的挑战。我们提出了软MPCritic，这是一个RL-MPC框架，在（软）价值空间学习，同时采用基于样本的规划进行在线控制和价值目标生成。软MPCritic通过模型预测路径积分控制（MPPI）实现MPC，并训练带有拟合值迭代的终端Q函数，使学到的值函数与规划者对齐，隐式延长有效规划视野。我们引入了一种摊销热启动策略，在计算基于批量MPPI的值目标时，回收在线观测中的计划性开环动作序列。这使得软MPCritic在计算上实用，同时保持了解决方案的质量。软MPCritic采用基于场景的方式规划，配备一组动态模型，训练以实现下一步预测精度。这些要素共同使软MPCritic能够通过稳健的短期规划，有效学习经典和复杂的控制任务。这些结果确立了软MPCritic作为一个实用且可扩展的蓝图，用于综合MPC政策，适用于政策提取和直接长期规划可能失败的环境。

DISCO-TAB: A Hierarchical Reinforcement Learning Framework for Privacy-Preserving Synthesis of Complex Clinical Data

DISCO-TAB：一个用于保护隐私的复杂临床数据综合的分层强化学习框架

Authors: Arshia Ilaty, Hossein Shirazi, Amir Rahmani, Hajar Homayouni
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2604.01481
Pdf link: https://arxiv.org/pdf/2604.01481
Abstract The development of robust clinical decision support systems is frequently impeded by the scarcity of high-fidelity, privacy-preserving biomedical data. While Generative Large Language Models (LLMs) offer a promising avenue for synthetic data generation, they often struggle to capture the complex, non-linear dependencies and severe class imbalances inherent in Electronic Health Records (EHR), leading to statistically plausible but clinically invalid records. To bridge this gap, we introduce DISCO-TAB (DIScriminator-guided COntrol for TABular synthesis), a novel framework that orchestrates a fine-tuned LLM with a multi-objective discriminator system optimized via Reinforcement Learning. Unlike prior methods relying on scalar feedback, DISCO-TAB evaluates synthesis at four granularities, token, sentence, feature, and row, while integrating Automated Constraint Discovery and Inverse-Frequency Reward Shaping to autonomously preserve latent medical logic and resolve minority-class collapse. We rigorously validate our framework across diverse benchmarks, including high-dimensional, small-sample medical datasets (e.g., Heart Failure, Parkinson's). Our results demonstrate that hierarchical feedback yields state-of-the-art performance, achieving up to 38.2% improvement in downstream clinical classifier utility compared to GAN and Diffusion baselines, while ensuring exceptional statistical fidelity (JSD < 0.01) and robust resistance to membership inference attacks. This work establishes a new standard for generating trustworthy, utility-preserving synthetic tabular data for sensitive healthcare applications.
中文摘要 健全的临床决策支持系统的发展常常受到高保真且保护隐私的生物医学数据的稀缺所阻碍。虽然生成式大型语言模型（LLMs）为合成数据生成提供了有前景的途径，但它们常常难以捕捉电子健康记录（EHR）中复杂、非线性的依赖关系和严重的类别不平衡，导致统计上合理但临床上无效的记录。为弥合这一差距，我们引入了DISCO-TAB（DIScriminator引导的TABular 合成CONTROL），这是一个新颖框架，通过强化学习优化了多目标判别器系统，协同精细调优的LLM。与以往依赖标量反馈的方法不同，DISCO-TAB在四粒度上评估综合：词元、句子、特征和行，同时集成自动约束发现和反频率奖励塑形，自主维护潜在医疗逻辑并解决少数群体崩溃。我们严格验证了该框架在多种基准测试中，包括高维、小样本的医学数据集（如心力衰竭、帕金森病）。我们的结果表明，层级反馈能够带来最先进的性能，与GAN和扩散基线相比，下游临床分类器效用提升了高达38.2%，同时确保了卓越的统计准确度（JSD < 0.01）和对成员推断攻击的强有力抵抗力。这项工作确立了生成可信、保效用的合成表格数据的新标准，应用于敏感医疗应用。

Matching Accuracy, Different Geometry: Evolution Strategies vs GRPO in LLM Post-Training

匹配准确性，不同几何结构：进化策略与大型语言模型后训练中的 GRPO

Authors: William Hoy, Binxu Wang, Xu Pan
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2604.01499
Pdf link: https://arxiv.org/pdf/2604.01499
Abstract Evolution Strategies (ES) have emerged as a scalable gradient-free alternative to reinforcement learning based LLM fine-tuning, but it remains unclear whether comparable task performance implies comparable solutions in parameter space. We compare ES and Group Relative Policy Optimization (GRPO) across four tasks in both single-task and sequential continual-learning settings. ES matches or exceeds GRPO in single-task accuracy and remains competitive sequentially when its iteration budget is controlled. Despite this similarity in task performance, the two methods produce markedly different model updates: ES makes much larger changes and induces broader off-task KL drift, whereas GRPO makes smaller, more localized updates. Strikingly, the ES and GRPO solutions are linearly connected with no loss barrier, even though their update directions are nearly orthogonal. We develop an analytical theory of ES that explains all these phenomena within a unified framework, showing how ES can accumulate large off-task movement on weakly informative directions while still making enough progress on the task to match gradient-based RL in downstream accuracy. These results show that gradient-free and gradient-based fine-tuning can reach similarly accurate yet geometrically distinct solutions, with important consequences for forgetting and knowledge preservation. The source code is publicly available: this https URL.
中文摘要 进化策略（ES）已成为基于强化学习的LLM微调之外的可扩展无梯度替代方案，但目前尚不清楚相似的任务性能是否意味着参数空间中的解决方案相当。我们比较了ES和Group Relative Policy Optimization（GRPO）在单任务和顺序持续学习环境中的四个任务。ES在单任务精度上与GRPO持平甚至超过，并且在控制迭代预算时保持连续竞争力。尽管任务性能相似，两种方法产生的模型更新有显著差异：ES做出更大幅度的变化，并引发更广泛的任务外KL漂移，而GRPO则进行较小且局部化的更新。值得注意的是，ES和GRPO解决方案线性连接且无损耗障碍，尽管它们的更新方向几乎正交。我们发展了一个ES的分析理论，在统一框架内解释了所有这些现象，展示了ES如何在信息量较弱的方向上积累大量任务外移动，同时在任务上取得足够进展，在下游准确度上与基于梯度的强化学习相当。这些结果表明，无梯度和基于梯度的微调可以达到同样准确但几何上不同的解，这对遗忘和知识保存有重要影响。源代码是公开的：这个 https URL。

DeltaMem: Towards Agentic Memory Management via Reinforcement Learning

DeltaMem：通过强化学习迈向能动记忆管理

Authors: Qi Zhang, Shen Huang, Chu Liu, Shouqing Yang, Junbo Zhao, Haobo Wang, Pengjun Xie
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2604.01560
Pdf link: https://arxiv.org/pdf/2604.01560
Abstract Recent advances in persona-centric memory have revealed the powerful capability of multi-agent systems in managing persona memory, especially in conversational scenarios. However, these complex frameworks often suffer from information loss and are fragile across varying scenarios, resulting in suboptimal performance. In this paper, we propose DeltaMem, an agentic memory management system that formulates persona-centric memory management as an end-to-end task within a single-agent setting. To further improve the performance of our agentic memory manager, we draw inspiration from the evolution of human memory and synthesize a user-assistant dialogue dataset along with corresponding operation-level memory updating labels. Building on this, we introduce a novel Memory-based Levenshtein Distance to formalize the memory updating reward, and propose a tailored reinforcement learning framework to further enhance the management capabilities of DeltaMem. Extensive experiments show that both training-free and RL-trained DeltaMem outperform all product-level baselines across diverse long-term memory benchmarks, including LoCoMo, HaluMem, and PersonaMem.
中文摘要 以人格为中心的记忆的最新进展揭示了多代理系统在管理人格记忆方面的强大能力，尤其是在对话场景中。然而，这些复杂框架常常存在信息丢失的问题，且在不同场景下脆弱，导致性能不理想。本文提出了DeltaMem，一种代理记忆管理系统，将以人格为中心的记忆管理作为单代理环境中的端到端任务。为了进一步提升我们的代理记忆管理器的性能，我们从人类记忆的演变中汲取灵感，综合了一个用户-助手对话数据集及相应的操作级内存更新标签。基于此，我们引入了一种新的基于记忆的Levenshtein距离，以形式化记忆更新奖励，并提出一个定制化的强化学习框架，进一步提升DeltaMem的管理能力。大量实验表明，无训练和强化学习训练的DeltaMem在包括LoCoMo、HaluMem和PersonaMem在内的多种长期记忆基准测试中均优于所有产品级基线。

Thinking While Listening: Fast-Slow Recurrence for Long-Horizon Sequential Modeling

边听边思考：长视野序列建模中的快慢递现

Authors: Shota Takashiro, Masanori Koyama, Takeru Miyato, Yusuke Iwasawa, Yutaka Matsuo, Kohei Hayashi
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2604.01577
Pdf link: https://arxiv.org/pdf/2604.01577
Abstract We extend the recent latent recurrent modeling to sequential input streams. By interleaving fast, recurrent latent updates with self-organizational ability between slow observation updates, our method facilitates the learning of stable internal structures that evolve alongside the input. This mechanism allows the model to maintain coherent and clustered representations over long horizons, improving out-of-distribution generalization in reinforcement learning and algorithmic tasks compared to sequential baselines such as LSTM, state space models, and Transformer variants.
中文摘要 我们将近期的潜在循环建模扩展到顺序输入流。通过将快速且重复的潜在更新与自组织能力交错在慢观察更新之间，我们的方法促进了与输入同步演化的稳定内部结构的学习。该机制使模型能够在较长视野内保持连贯且聚类的表示，从而提升了强化学习和算法任务中的分布外泛化，相较于顺序基线如LSTM、状态空间模型和Transformer变体。

MM-ReCoder: Advancing Chart-to-Code Generation with Reinforcement Learning and Self-Correction

MM-ReCoder：通过强化学习和自我纠正推进图表到代码生成

Authors: Zitian Tang, Xu Zhang, Jianbo Yuan, Yang Zou, Varad Gunjal, Songyao Jiang, Davide Modolo
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2604.01600
Pdf link: https://arxiv.org/pdf/2604.01600
Abstract Multimodal Large Language Models (MLLMs) have recently demonstrated promising capabilities in multimodal coding tasks such as chart-to-code generation. However, existing methods primarily rely on supervised fine-tuning (SFT), which requires the model to learn code patterns through chart-code pairs but does not expose the model to a code execution environment. Moreover, while self-correction through execution feedback offers a potential route to improve coding quality, even state-of-the-art MLLMs have been shown to struggle with effective self-correction. In this work, we introduce MM-ReCoder, a chart-to-code generation model trained with reinforcement learning (RL) and equipped with self-correction ability. We propose a two-stage multi-turn self-correction RL strategy based on Group Relative Policy Optimization (GRPO). The first stage enhances the model's self-correction ability via rolling out a shared first turn, while the second stage improves the coding capability with full-trajectory optimization. MM-ReCoder learns to produce more accurate and executable code through the interaction with the environment and by iteratively correcting its own outputs. Our results on three chart-to-code benchmarks demonstrate the state-of-the-art performance of MM-ReCoder.
中文摘要 多模态大型语言模型（MLLM）最近在多模态编码任务（如图表到代码生成）方面展现出有前景的能力。然而，现有方法主要依赖监督微调（SFT），该方法要求模型通过图表与代码对学习代码模式，但不将模型暴露于代码执行环境中。此外，虽然通过执行反馈进行自我纠正提供了提升编码质量的潜在途径，但即使是最先进的多层次营销模型也被证明难以有效实现自我纠正。在本研究中，我们介绍了MM-ReCoder，一种通过强化学习（RL）训练并具备自我纠正能力的图表到代码生成模型。我们提出了基于群体相对政策优化（GRPO）的两阶段多回合自我纠正强化强化学习策略。第一阶段通过推出共享的第一回合增强模型的自我修正能力，第二阶段通过全轨迹优化提升编码能力。MM-ReCoder 通过与环境的交互以及反复修正自身输出，学会生成更准确且可执行的代码。我们在三个图表到代码基准测试上的成果展示了MM-ReCoder的先进性能。

Pseudo-Quantized Actor-Critic Algorithm for Robustness to Noisy Temporal Difference Error

伪量子化的演员-批评者算法，用于对噪声时间差误的鲁棒性

Authors: Taisuke Kobayashi
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2604.01613
Pdf link: https://arxiv.org/pdf/2604.01613
Abstract In reinforcement learning (RL), temporal difference (TD) errors are widely adopted for optimizing value and policy functions. However, since the TD error is defined by a bootstrap method, its computation tends to be noisy and destabilize learning. Heuristics to improve the accuracy of TD errors, such as target networks and ensemble models, have been introduced so far. While these are essential approaches for the current deep RL algorithms, they cause side effects like increased computational cost and reduced learning efficiency. Therefore, this paper revisits the TD learning algorithm based on control as inference, deriving a novel algorithm capable of robust learning against noisy TD errors. First, the distribution model of optimality, a binary random variable, is represented by a sigmoid function. Alongside forward and reverse Kullback-Leibler divergences, this new model derives a robust learning rule: when the sigmoid function saturates with a large TD error probably due to noise, the gradient vanishes, implicitly excluding it from learning. Furthermore, the two divergences exhibit distinct gradient-vanishing characteristics. Building on these analyses, the optimality is decomposed into multiple levels to achieve pseudo-quantization of TD errors, aiming for further noise reduction. Additionally, a Jensen-Shannon divergence-based approach is approximately derived to inherit the characteristics of both divergences. These benefits are verified through RL benchmarks, demonstrating stable learning even when heuristics are insufficient or rewards contain noise.
中文摘要 在强化学习（RL）中，时间差分（TD）错误被广泛用于优化价值函数和策略函数。然而，由于TD误差由引导法定义，其计算往往噪声较大，导致学习不稳定。迄今为止，已引入了提高TD误差准确率的启发式方法，如目标网络和集合模型。虽然这些是当前深度强化学习算法的必备方法，但它们带来了计算成本增加和学习效率下降等副作用。因此，本文重新审视基于控制推断的TD学习算法，推导出一种能够对噪声TD错误进行稳健学习的新算法。首先，最优分布模型，即二元随机变量，用一个S形函数表示。除了正向和反向的Kullback-Leibler发散外，该新模型推导出了一个稳健的学习规则：当S形函数因噪声导致TD误差大时，梯度消失，隐含地将其排除在学习之外。此外，这两个发散点表现出明显的梯度消失特性。基于这些分析，最优性被分解为多个层级，以实现TD误差的伪量子化，旨在进一步降低噪声。此外，基于Jensen-Shannon发散的方法近似推导以继承两个发散的特性。这些优势通过强化学习基准得到验证，即使启发式不足或奖励含有噪声，也能保持稳定学习。

ContextBudget: Budget-Aware Context Management for Long-Horizon Search Agents

ContextBudget：面向长期搜索代理的预算感知上下文管理

Authors: Yong Wu, YanZhao Zheng, TianZe Xu, ZhenTao Zhang, YuanQiang Yu, JiHuai Zhu, Chao Ma, BinBin Lin, BaoHua Dong, HangCheng Zhu, RuoHui Huang, Gang Yu
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2604.01664
Pdf link: https://arxiv.org/pdf/2604.01664
Abstract LLM-based agents show strong potential for long-horizon reasoning, yet their context size is limited by deployment factors (e.g., memory, latency, and cost), yielding a constrained context budget. As interaction histories grow, this induces a trade-off between retaining past information and staying within the context limit. To address this challenge, we propose Budget-Aware Context Management (BACM), which formulates context management as a sequential decision problem with a context budget constraint. It enables agents to assess the available budget before incorporating new observations and decide when and how much of the interaction history to compress. We further develop BACM-RL, an end-to-end curriculum-based reinforcement learning approach that learns compression strategies under varying context budgets. Experiments on compositional multi-objective QA and long-horizon web browsing benchmarks show that BACM-RL consistently outperforms prior methods across model scales and task complexities, achieving over $1.6\times$ gains over strong baselines in high-complexity settings, while maintaining strong advantages as budgets shrink, where most methods exhibit a downward performance trend.
中文摘要 基于LLM的代理在长期推理方面展现出强大潜力，但其上下文规模受部署因素（如内存、延迟和成本）限制，导致上下文预算受限。随着交互历史的增长，这导致在保留过去信息和保持上下文限制之间产生权衡。为应对这一挑战，我们提出了预算感知情境管理（BACM），该方法将情境管理表述为带有情境预算约束的顺序决策问题。它使代理能够在纳入新观察数据前评估可用预算，并决定何时以及压缩多少交互历史。我们还进一步开发了BACM-RL，这是一种端到端的基于课程的强化学习方法，在不同情境预算下学习压缩策略。在组合多目标质量保证和长期网页浏览基准测试上的实验显示，BACM-RL在模型规模和任务复杂度下持续优于以往方法，在高复杂度环境中相比强基线提升超过1.6美元乘倍美元，同时在预算缩减时保持显著优势，大多数方法表现呈下降趋势。

DEFT: Distribution-guided Efficient Fine-Tuning for Human Alignment

DEFT：分布引导高效微调以实现人类对齐

Authors: Liang Zhu, Feiteng Fang, Yuelin Bai, Longze Chen, Zhexiang Zhang, Minghuan Tan, Min Yang
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2604.01787
Pdf link: https://arxiv.org/pdf/2604.01787
Abstract Reinforcement Learning from Human Feedback (RLHF), using algorithms like Proximal Policy Optimization (PPO), aligns Large Language Models (LLMs) with human values but is costly and unstable. Alternatives have been proposed to replace PPO or integrate Supervised Fine-Tuning (SFT) and contrastive learning for direct fine-tuning and value alignment. However, these methods still require voluminous data to learn preferences and may weaken the generalization ability of LLMs. To further enhance alignment efficiency and performance while mitigating the loss of generalization ability, this paper introduces Distribution-guided Efficient Fine-Tuning (DEFT), an efficient alignment framework incorporating data filtering and distributional guidance by calculating the differential distribution reward based on the output distribution of language model and the discrepancy distribution of preference data. A small yet high-quality subset is filtered from the raw data using a differential distribution reward, which is then incorporated into existing alignment methods to guide the model's output distribution. Experimental results demonstrate that the methods enhanced by DEFT outperform the original methods in both alignment capability and generalization ability, with significantly reduced training time.
中文摘要 基于人类反馈的强化学习（RLHF）使用近端策略优化（PPO）等算法，使大型语言模型（LLMs）与人类价值观对齐，但成本高昂且不稳定。已有替代方案被提出，以取代PPO，或整合监督微调（SFT）和对比学习，实现直接微调和价值对齐。然而，这些方法仍需大量数据来学习偏好，可能削弱LLMs的泛化能力。为了进一步提升比对效率和性能，同时减少泛化能力的丧失，本文介绍了分布引导高效微调（DEFT），这是一种高效的比对框架，结合数据过滤和分布指导，通过基于语言模型输出分布和偏好数据差异分布计算差分分布奖励。从原始数据中筛选出一个小而高质量的子集，利用差分分布奖励，然后将其纳入现有比对方法中，以指导模型的输出分布。实验结果表明，由DEFT增强的方法在比对能力和泛化能力上均优于原始方法，训练时间显著缩短。

TestDecision: Sequential Test Suite Generation via Greedy Optimization and Reinforcement Learning

TestDecision：通过贪婪优化与强化学习生成顺序测试套件

Authors: Guoqing Wang, Chengran Yang, Xiaoxuan Zhou, Zeyu Sun, Bo Wang, David Lo, Dan Hao
Subjects: Subjects: Software Engineering (cs.SE)
Arxiv link: https://arxiv.org/abs/2604.01799
Pdf link: https://arxiv.org/pdf/2604.01799
Abstract With the rapid evolution of LLMs, automated software testing is witnessing a paradigm shift. While proprietary models like GPT-4o demonstrate impressive capabilities, their high deployment costs and data privacy concerns make open-source LLMs the practical imperative for many academic and industrial scenarios. In the field of automated test generation, it has evolved to iterative workflows to construct test suites based on LLMs. When utilizing open-source LLMs, we empirically observe they lack a suite-level perspective, suffering from structural myopia-failing to generate new tests with large marginal gain based on the current covered status. In this paper, from the perspective of sequences, we formalize test suite generation as a MDP and demonstrate that its objective exhibits monotone submodularity, which enables an effective relaxation of this NP-hard global optimization into a tractable step-wise greedy procedure. Guided by this insight, we propose TestDecision, which transforms LLMs into neural greedy experts. TestDecision consists of two synergistic components: (1) an inference framework which implements test suite construction following a step-wise greedy strategy; and (2) a training pipeline of reinforcement learning which equips the base LLM with sequential test generation ability to maximize marginal gain. Comprehensive evaluations on the ULT benchmark demonstrate that TestDecision significantly outperforms existing advanced methods. It brings an improvement between 38.15-52.37% in branch coverage and 298.22-558.88% in execution pass rate over all base models, achieving a comparable performance on 7B backbone with a much larger proprietary LLM GPT-5.2. Furthermore, TestDecision can find 58.43-95.45% more bugs than vanilla base LLMs and exhibit superior generalization on LiveCodeBench, proving its capability to construct high-quality test suites.
中文摘要 随着大型语言模型（LLM）的快速发展，自动化软件测试正经历范式转变。虽然像GPT-4o这样的专有模型展现了令人印象深刻的能力，但其高昂的部署成本和数据隐私问题使开源LLM成为许多学术和工业场景的实际必需。在自动化测试生成领域，已发展为基于大型语言模型构建测试套件的迭代工作流。在使用开源大型语言模型时，我们实证观察到它们缺乏套件层面的视角，存在结构性近视——无法生成基于当前覆盖状态的巨大边际增益的新测试。本文从序列的角度，我们将测试套件生成形式化为MDP，并证明其目标具有单调子模性，使得将NP硬全局优化有效松弛为可处理的分步贪婪过程。基于这一见解，我们提出了TestDecision，将LLM转变为神经贪婪专家。TestDecision 由两个协同组件组成：（1）一个推理框架，采用逐步贪婪策略实现测试套件构建;以及（2）强化学习的训练流水线，使基础LLM具备顺序测试生成能力，以最大化边际收益。对ULT基准的全面评估显示，TestDecision远远优于现有的先进方法。它在所有基础模型中，分支覆盖率提升了38.15%至52.37%，执行通过率提升至298.22%至558.88%，在7B骨干网与更大专有LLM GPT-5.2上实现了相当的性能。此外，TestDecision比原版大型语言模型多发现58.43%至95.45%的错误，并在LiveCodeBench上展现出更优的泛化能力，证明了其构建高质量测试套件的能力。

STRIVE: Structured Spatiotemporal Exploration for Reinforcement Learning in Video Question Answering

STRIVE：视频问答中强化学习的结构化时空探索

Authors: Emad Bahrami, Olga Zatsarynna, Parth Pathak, Sunando Sengupta, Juergen Gall, Mohsen Fayyaz
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2604.01824
Pdf link: https://arxiv.org/pdf/2604.01824
Abstract We introduce STRIVE (SpatioTemporal Reinforcement with Importance-aware Variant Exploration), a structured reinforcement learning framework for video question answering. While group-based policy optimization methods have shown promise in large multimodal models, they often suffer from low reward variance when responses exhibit similar correctness, leading to weak or unstable advantage estimates. STRIVE addresses this limitation by constructing multiple spatiotemporal variants of each input video and performing joint normalization across both textual generations and visual variants. By expanding group comparisons beyond linguistic diversity to structured visual perturbations, STRIVE enriches reward signals and promotes more stable and informative policy updates. To ensure exploration remains semantically grounded, we introduce an importance-aware sampling mechanism that prioritizes frames most relevant to the input question while preserving temporal coverage. This design encourages robust reasoning across complementary visual perspectives rather than overfitting to a single spatiotemporal configuration. Experiments on six challenging video reasoning benchmarks including VideoMME, TempCompass, VideoMMMU, MMVU, VSI-Bench, and PerceptionTest demonstrate consistent improvements over strong reinforcement learning baselines across multiple large multimodal models. Our results highlight the role of structured spatiotemporal exploration as a principled mechanism for stabilizing multimodal reinforcement learning and improving video reasoning performance.
中文摘要 我们介绍了STRIVE（具有重要性感知变异探索的时空强化），这是一个用于视频问答的结构化强化学习框架。虽然基于群体的策略优化方法在大型多模态模型中展现出潜力，但当响应表现出相似的正确性时，通常会存在较低的奖励方差，导致优势估计较弱或不稳定。STRIVE 通过构建每个输入视频的多个时空变体，并在文本生成和视觉变体之间进行联合归一化，解决了这一限制。通过将群体比较扩展到结构化的视觉扰动，STRIVE丰富了奖励信号，促进了更稳定和富有信息性的政策更新。为确保探索保持语义基础，我们引入了一种重要性感知抽样机制，优先考虑与输入问题最相关的框架，同时保持时间覆盖。这种设计鼓励在互补的视觉视角间进行扎实的推理，而不是过度拟合于单一的时空配置。在六个具有挑战性的视频推理基准测试（包括VideoMME、TempCompass、VideoMMMU、MMVU、VSI-Bench和PerceptionTest）上的实验显示，在多个大型多模态模型中，相较于强强化学习基线，取得了持续的提升。我们的结果强调了结构化时空探索作为稳定多模态强化学习和提升视频推理性能的原则机制的作用。

Physics Informed Reinforcement Learning with Gibbs Priors for Topology Control in Power Grids

基于Gibbs先验的物理知情强化学习用于电网拓扑控制

Authors: Pantelis Dogoulis, Maxime Cordy
Subjects: Subjects: Machine Learning (cs.LG); Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2604.01830
Pdf link: https://arxiv.org/pdf/2604.01830
Abstract Topology control for power grid operation is a challenging sequential decision making problem because the action space grows combinatorially with the size of the grid and action evaluation through simulation is computationally expensive. We propose a physics-informed Reinforcement Learning framework that combines semi-Markov control with a Gibbs prior, that encodes the system's physics, over the action space. The decision is only taken when the grid enters a hazardous regime, while a graph neural network surrogate predicts the post action overload risk of feasible topology actions. These predictions are used to construct a physics-informed Gibbs prior that both selects a small state-dependent candidate set and reweights policy logits before action selection. In this way, our method reduces exploration difficulty and online simulation cost while preserving the flexibility of a learned policy. We evaluate the approach in three realistic benchmark environments of increasing difficulty. Across all settings, the proposed method achieves a strong balance between control quality and computational efficiency: it matches oracle-level performance while being approximately $6\times$ faster on the first benchmark, reaches $94.6\%$ of oracle reward with roughly $200\times$ lower decision time on the second one, and on the most challenging benchmark improves over a PPO baseline by up to $255\%$ in reward and $284\%$ in survived steps while remaining about $2.5\times$ faster than a strong specialized engineering baseline. These results show that our method provides an effective mechanism for topology control in power grids.
中文摘要 电网运行中的拓扑控制是一个具有挑战性的顺序决策问题，因为作用空间随着电网规模的组合增长，而通过仿真进行动作评估计算成本高昂。我们提出了一个基于物理的强化学习框架，结合了半马尔可夫控制与对系统物理的吉布斯先验，后者在作用空间上编码系统的物理。只有当网格进入危险区时才会做出决策，而图神经网络代理则预测可行拓扑动作的后过载风险。这些预测用于构建一个基于物理的吉布斯先验，既选择一个小型依赖状态的候选集，又在选择行动前重新加权策略日志。通过这种方式，我们的方法降低了探索难度和在线模拟成本，同时保留了学习策略的灵活性。我们在三个难度递增的现实基准环境中评估该方法。在所有设置下，该方法在控制质量与计算效率之间取得了良好平衡：它在第一个基准测试上速度约快6倍，达到了94.6美元的预言机奖励，第二个基准测试的决策时间约低200倍，且在最具挑战性的基准测试中，其奖励提升了最多255美元，存活步数提升了284美元，同时速度仍比PPO快约2.5倍倍坚实的专业工程基础。这些结果表明，我们的方法为电网中的拓扑控制提供了有效的机制。

Not All Tokens See Equally: Perception-Grounded Policy Optimization for Large Vision-Language Models

并非所有代币的视野都相同：基于感知的策略优化适用于大型视觉语言模型

Authors: Zekai Ye, Qiming Li, Xiaocheng Feng, Ruihan Chen, Ziming Li, Haoyu Ren, Kun Chen, Dandan Tu, Bing Qin
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2604.01840
Pdf link: https://arxiv.org/pdf/2604.01840
Abstract While Reinforcement Learning from Verifiable Rewards (RLVR) has advanced reasoning in Large Vision-Language Models (LVLMs), prevailing frameworks suffer from a foundational methodological flaw: by distributing identical advantages across all generated tokens, these methods inherently dilute the learning signals essential for optimizing the critical, visually-grounded steps of multimodal reasoning. To bridge this gap, we formulate \textit{Token Visual Dependency}, quantifying the causal information gain of visual inputs via the Kullback-Leibler (KL) divergence between visual-conditioned and text-only predictive distributions. Revealing that this dependency is highly sparse and semantically pivotal, we introduce Perception-Grounded Policy Optimization (PGPO), which is a novel fine-grained credit assignment framework that dynamically reshapes advantages at the token level. Through a threshold-gated, mass-conserving mechanism, PGPO actively amplifies learning signals for visually-dependent tokens while suppressing gradient noise from linguistic priors. Extensive experiments based on the Qwen2.5-VL series across seven challenging multimodal reasoning benchmarks demonstrate that PGPO boosts models by 18.7% on average. Both theoretical and empirical analyses confirm that PGPO effectively reduces gradient variance, prevents training collapse, and acts as a potent regularizer for robust, perception-grounded multimodal reasoning. Code will be published on this https URL.
中文摘要 虽然可验证奖励强化学习（RLVR）在大型视觉语言模型（LVLMs）中推动了推理，但现有框架存在一个根本方法论缺陷：通过在所有生成的代币中分配相同的优势，这些方法本质上稀释了优化多模态推理关键、视觉基础步骤所必需的学习信号。为弥合这一差距，我们提出了\textit{Token Visual Dependency}，通过Kullback-Leibler（KL）发散，量化视觉输入在视觉条件与仅文本预测分布之间的因果信息增益。揭示这种依赖关系极为稀疏且语义上至关重要，我们介绍了基于感知的政策优化（Perception-Grounded Policy Optimization，PGPO），这是一种新型细粒度信用分配框架，能够动态重塑代币层面的优势。通过阈限门控、质量守恒机制，PGPO主动放大视觉依赖词的学习信号，同时抑制语言先例的梯度噪声。基于Qwen2.5-VL系列的广泛实验，涵盖七个具有挑战性的多模态推理基准测试，表明PGPO平均提升模型18.7%。理论和实证分析均证实PGPO有效减少梯度方差，防止训练崩溃，并作为强有力的正则化器，支持稳健、基于感知的多模态推理。代码将发布在这个 https URL 上。

From Guessing to Placeholding: A Cost-Theoretic Framework for Uncertainty-Aware Code Completion

从猜测到定位：一个用于不确定性感知代码完成的成本理论框架

Authors: Liang Zhu, Haolin Chen, Lidong Zhao, Xian Wu
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2604.01849
Pdf link: https://arxiv.org/pdf/2604.01849
Abstract While Large Language Models (LLMs) have demonstrated exceptional proficiency in code completion, they typically adhere to a Hard Completion (HC) paradigm, compelling the generation of fully concrete code even amidst insufficient context. Our analysis of 3 million real-world interactions exposes the limitations of this strategy: 61% of the generated suggestions were either edited after acceptance or rejected despite exhibiting over 80% similarity to the user's subsequent code, suggesting that models frequently make erroneous predictions at specific token positions. Motivated by this observation, we propose Adaptive Placeholder Completion (APC), a collaborative framework that extends HC by strategically outputting explicit placeholders at high-entropy positions, allowing users to fill directly via IDE navigation. Theoretically, we formulate code completion as a cost-minimization problem under uncertainty. Premised on the observation that filling placeholders incurs lower cost than correcting errors, we prove the existence of a critical entropy threshold above which APC achieves strictly lower expected cost than HC. We instantiate this framework by constructing training data from filtered real-world edit logs and design a cost-based reward function for reinforcement learning. Extensive evaluations across 1.5B--14B parameter models demonstrate that APC reduces expected editing costs from 19% to 50% while preserving standard HC performance. Our work provides both a theoretical foundation and a practical training framework for uncertainty-aware code completion, demonstrating that adaptive abstention can be learned end-to-end without sacrificing conventional completion quality.
中文摘要 虽然大型语言模型（LLM）在代码补全方面表现出卓越的熟练度，但它们通常遵循硬补全（HC）范式，即使上下文不足，也必须生成完全具体的代码。我们对300万次真实交互的分析揭示了该策略的局限性：61%的生成建议在接受后被编辑或拒绝，尽管其与用户后续代码的相似度超过80%，表明模型在特定代币位置经常做出错误预测。基于这一观察，我们提出了自适应占位符补全（APC）这一协作框架，通过在高熵位置战略性地输出显式占位符，使用户能够通过IDE导航直接填充HC。理论上，我们将代码补全表述为在不确定性下的成本最小化问题。基于填充占位符的成本低于纠正错误的观察，我们证明存在一个临界熵阈值，超过该阈值APC的期望成本严格低于HC。我们通过从过滤后的真实世界编辑日志构建训练数据，并设计基于成本的奖励函数来实现该框架。在1.5B至14B参数模型中的广泛评估表明，APC在保持标准HC性能的同时，将预期编辑成本从19%降至50%。我们的工作既为不确定性感知代码补全提供了理论基础，也提供了实践培训框架，证明自适应隐匿可以在不牺牲传统补全质量的情况下从头到尾学习。

The Rank and Gradient Lost in Non-stationarity: Sample Weight Decay for Mitigating Plasticity Loss in Reinforcement Learning

非平稳性中排名和梯度的损失：样本重量衰减在强化学习中缓解可塑性损失

Authors: Zihao Wu, Hongyao Tang, Yi Ma, Jiashun Liu, Yan Zheng, Jianye Hao
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2604.01913
Pdf link: https://arxiv.org/pdf/2604.01913
Abstract Deep reinforcement learning (RL) suffers from plasticity loss severely due to the nature of non-stationarity, which impairs the ability to adapt to new data and learn continually. Unfortunately, our understanding of how plasticity loss arises, dissipates, and can be dissolved remains limited to empirical findings, leaving the theoretical end this http URL address this gap, we study the plasticity loss problem from the theoretical perspective of network optimization. By formally characterizing the two culprit factors in online RL process: the non-stationarity of data distributions and the non-stationarity of targets induced by bootstrapping, our theory attributes the loss of plasticity to two mechanisms: the rank collapse of the Neural Tangent Kernel (NTK) Gram matrix and the $\Theta(\frac{1}{k})$ decay of gradient magnitude. The first mechanism echoes prior empirical findings from the theoretical perspective and sheds light on the effects of existing methods, e.g., network reset, neuron recycle, and noise injection. Against this backdrop, we focus primarily on the second mechanism and aim to alleviate plasticity loss by addressing the gradient attenuation issue, which is orthogonal to existing methods. We propose Sample Weight Decay -- a lightweight method to restore gradient magnitude, as a general remedy to plasticity loss for deep RL methods based on experience replay. In experiments, we evaluate the efficacy of \methodName upon TD3, \myadded{Double DQN} and SAC with SimBa architecture in MuJoCo, \myadded{ALE} and DeepMind Control Suite tasks. The results demonstrate that \methodName effectively alleviates plasticity loss and consistently improves learning performance across various configurations of deep RL algorithms, UTD, network architectures, and environments, achieving SOTA performance on challenging DMC Humanoid tasks.
中文摘要 深度强化学习（RL）由于非平稳性特性，严重导致可塑性丧失，这削弱了适应新数据和持续学习的能力。遗憾的是，我们对可塑性损失如何产生、消散和解解的理解仍限于实证发现，理论端尚未解决这一空白，我们从网络优化的理论视角研究可塑性损失问题。通过形式化描述在线强化学习过程中的两个罪魁祸首：数据分布的非平稳性和由自助法诱导的目标非平稳，我们的理论将可塑性丧失归因于两个机制：神经切核（NTK）Gram矩阵的秩坍缩和梯度大小的$\Theta（\frac{1}{k}）$衰减。第一个机制从理论角度呼应了以往的经验研究成果，并揭示了现有方法的影响，例如网络重置、神经元回收和噪声注入。在此背景下，我们主要关注第二个机制，旨在通过解决与现有方法正交的梯度衰减问题来缓解可塑性损失。我们提出了样本权重衰变——一种轻量级方法，用于恢复梯度，作为基于经验回放的深度强化学习方法可塑性损失的通用解决方案。在实验中，我们评估了 \methodName 在 MuJoCo、\myadded{ALE} 和 DeepMind Control Suite 任务中，使用 SimBa 架构的 TD3、\myadded{Double DQN} 和 SAC 的有效性。结果表明，\methodName 有效缓解可塑性损失，并持续提升深度强化学习算法、UTD、网络架构和环境等多种配置的学习性能，在具有挑战性的 DMC 类人生物任务中实现 SOTA 性能。

Captioning Daily Activity Images in Early Childhood Education: Benchmark and Algorithm

早期儿童教育中的每日活动图片字幕：基准与算法

Authors: Sixing Li, Zhibin Gu, Ziqi Zhang, Weiguo Pan, Bing Li, Ying Wang, Hongzhe Liu
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2604.01941
Pdf link: https://arxiv.org/pdf/2604.01941
Abstract Image captioning for Early Childhood Education (ECE) is essential for automated activity understanding and educational assessment. However, existing methods face two key challenges. First, the lack of large-scale, domain-specific datasets limits the model's ability to capture fine-grained semantic concepts unique to ECE scenarios, resulting in generic and imprecise descriptions. Second, conventional training paradigms exhibit limitations in enhancing professional object description capability, as supervised learning tends to favor high-frequency expressions, while reinforcement learning may suffer from unstable optimization on difficult samples. To address these limitations, we introduce ECAC, a large-scale benchmark for ECE daily activity image captioning, comprising 256,121 real-world images annotated with expert-level captions and fine-grained labels. ECAC is further equipped with a domain-oriented evaluation protocol, the Teaching Toy Recognition Score (TTS), to explicitly measure professional object naming accuracy. Furthermore, we propose RSRS (Reward-Conditional Switch of Reinforcement Learning and Supervised Fine-Tuning), a hybrid training framework that dynamically alternates between RL and supervised optimization. By rerouting hard samples with zero rewards to supervised fine-tuning, RSRS effectively mitigates advantage collapse and enables stable optimization for fine-grained recognition. Leveraging ECAC and RSRS, we develop KinderMM-Cap-3B, a domain-adapted multimodal large language model. Extensive experiments demonstrate that our model achieves a TTS of 51.06, substantially outperforming state-of-the-art baselines while maintaining superior caption quality, highlighting its potential for specialized educational applications.
中文摘要 早期儿童教育（ECE）的图片说明对于自动化活动理解和教育评估至关重要。然而，现有方法面临两个关键挑战。首先，缺乏大规模、领域特定的数据集限制了模型捕捉ECE场景中独特细粒度语义概念的能力，导致描述泛泛且不精确。其次，传统训练范式在提升专业对象描述能力方面存在局限性，监督学习倾向于高频表达，而强化学习在困难样本上可能存在不稳定的优化问题。为解决这些局限性，我们引入了ECAC，这是一个大型ECE日常活动图像说明基准，包含256,121张真实世界图像，配有专家级说明和细粒度标签。ECAC还配备了面向领域的评估协议——教学玩具识别评分（TTS），以明确衡量专业物品命名的准确性。此外，我们提出了RSRS（奖励条件交换强化学习与监督微调），这是一种动态交替在强化学习和监督优化之间切换的混合训练框架。通过将无奖励的硬样本重定向到监督微调，RSRS有效减轻优势崩溃，并实现细粒度识别的稳定优化。利用ECAC和RSRS，我们开发了KinderMM-Cap-3B，一个领域适配的多模态大型语言模型。大量实验表明，我们的模型实现了51.06的TTS，远超最先进的基线，同时保持优越的字幕质量，凸显其在专业教育应用中的潜力。

ProCeedRL: Process Critic with Exploratory Demonstration Reinforcement Learning for LLM Agentic Reasoning

ProCeedRL：提供探索性演示强化学习的过程批评者，适用于LLM代理推理

Authors: Jingyue Gao, Yanjiang Guo, Xiaoshuai Chen, Jianyu Chen
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2604.02006
Pdf link: https://arxiv.org/pdf/2604.02006
Abstract Reinforcement Learning (RL) significantly enhances the reasoning abilities of large language models (LLMs), yet applying it to multi-turn agentic tasks remains challenging due to the long-horizon nature of interactions and the stochasticity of environmental feedback. We identify a structural failure mode in agentic exploration: suboptimal actions elicit noisy observations into misleading contexts, which further weaken subsequent decision-making, making recovery increasingly difficult. This cumulative feedback loop of errors renders standard exploration strategies ineffective and susceptible to the model's reasoning and the environment's randomness. To mitigate this issue, we propose ProCeedRL: Process Critic with Explorative Demonstration RL, shifting exploration from passive selection to active intervention. ProCeedRL employs a process-level critic to monitor interactions in real time, incorporating reflection-based demonstrations to guide agents in stopping the accumulation of errors. We find that this approach significantly exceeds the model's saturated exploration performance, demonstrating substantial exploratory benefits. By learning from exploratory demonstrations and on-policy samples, ProCeedRL significantly improves exploration efficiency and achieves superior performance on complex deep search and embodied tasks.
中文摘要 强化学习（RL）显著增强了大型语言模型（LLM）的推理能力，但由于交互具有长视野和环境反馈的随机性，将其应用于多回合代理任务仍具挑战性。我们在能动探索中识别出一种结构性失效模式：次优行为会引发噪声观察到误导性情境，进一步削弱后续决策，使恢复变得越来越困难。这种累积的误差反馈循环使标准探索策略无效，容易受模型推理和环境随机性影响。为缓解这一问题，我们提出了ProCeedRL：过程批判者与探索性示范强化学习，将探索从被动选择转向主动干预。ProCeedRL采用过程级批评器实时监控交互，结合基于反射的演示，指导代理阻止错误的累积。我们发现该方法远超模型的饱和探索性能，展现出显著的探索性优势。通过从探索性演示和政策样本中学习，ProCeedRL显著提升了探索效率，并在复杂的深度搜索和具象任务中实现了卓越的性能。

Apriel-Reasoner: RL Post-Training for General-Purpose and Efficient Reasoning

Apriel-Reasoner：通用且高效推理的强化后培训

Authors: Rafael Pardinas, Ehsan Kamalloo, David Vazquez, Alexandre Drouin
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2604.02007
Pdf link: https://arxiv.org/pdf/2604.02007
Abstract Building general-purpose reasoning models using reinforcement learning with verifiable rewards (RLVR) across diverse domains has been widely adopted by frontier open-weight models. However, their training recipes and domain mixtures are often not disclosed. Joint optimization across domains poses significant challenges: domains vary widely in rollout length, problem difficulty and sample efficiency. Further, models with long chain-of-thought traces increase inference cost and latency, making efficiency critical for practical deployment. We present Apriel-Reasoner, trained with a fully reproducible multi-domain RL post-training recipe on Apriel-Base, a 15B-parameter open-weight LLM, across five domains using public datasets: mathematics, code generation, instruction following, logical puzzles and function calling. We introduce an adaptive domain sampling mechanism that preserves target domain ratios despite heterogeneous rollout dynamics, and a difficulty-aware extension of the standard length penalty that, with no additional training overhead, encourages longer reasoning for difficult problems and shorter traces for easy ones. Trained with a strict 16K-token output budget, Apriel-Reasoner generalizes to 32K tokens at inference and improves over Apriel-Base on AIME 2025, GPQA, MMLU-Pro, and LiveCodeBench while producing 30-50% shorter reasoning traces. It matches strong open-weight models of similar size at lower token cost, thereby pushing the Pareto frontier of accuracy versus token budget.
中文摘要 利用可验证奖励的强化学习（RLVR）构建跨多个领域的通用推理模型已被前沿开放权重模型广泛采用。然而，他们的训练配方和领域混合物往往未被披露。跨域联合优化存在重大挑战：各领域在推广长度、问题难度和样本效率方面差异巨大。此外，长思考链的模型增加了推理成本和延迟，使得效率对实际部署至关重要。我们展示了Apriel-Reasoner，该方案在Apriel-Base（一个15B参数的开权大语言模型）上，使用五个领域：数学、代码生成、指令跟随、逻辑谜题和函数调用等五个领域进行训练。我们引入了一种自适应域采样机制，尽管展开动态异质，目标域比例仍能保持，并对标准长度惩罚进行了难度感知的扩展，无需额外训练开销，鼓励对困难问题进行更长时间的推理，对简单问题进行更短的追踪。Apriel-Reasoner 在严格的 16K 令牌输出预算下训练，推理时推广至 32K 令牌，并且在 AIME 2025、GPQA、MMLU-Pro 和 LiveCodeBench 上优于 Apriel-Base，同时产生了 30-50% 的推理短距离。它以更低的代币成本匹配了规模相似的强开放权模型，推动了准确度与代币预算的帕累托边界。

Bridging Discrete Planning and Continuous Execution for Redundant Robot

桥接离散规划与冗余机器人的持续执行

Authors: Teng Yan, Yue Yu, Yihan Liu, Bingzhuo Zhong
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2604.02021
Pdf link: https://arxiv.org/pdf/2604.02021
Abstract Voxel-grid reinforcement learning is widely adopted for path planning in redundant manipulators due to its simplicity and reproducibility. However, direct execution through point-wise numerical inverse kinematics on 7-DoF arms often yields step-size jitter, abrupt joint transitions, and instability near singular configurations. This work proposes a bridging framework between discrete planning and continuous execution without modifying the discrete planner itself. On the planning side, step-normalized 26-neighbor Cartesian actions and a geometric tie-breaking mechanism are introduced to suppress unnecessary turns and eliminate step-size oscillations. On the execution side, a task-priority damped least-squares (TP-DLS) inverse kinematics layer is implemented. This layer treats end-effector position as a primary task, while posture and joint centering are handled as subordinate tasks projected into the null space, combined with trust-region clipping and joint velocity constraints. On a 7-DoF manipulator in random sparse, medium, and dense environments, this bridge raises planning success in dense scenes from about 0.58 to 1.00, shortens representative path length from roughly 1.53 m to 1.10 m, and while keeping end-effector error below 1 mm, reduces peak joint accelerations by over an order of magnitude, substantially improving the continuous execution quality of voxel-based RL paths on redundant manipulators.
中文摘要 体素网格强化学习因其简单性和可重复性，被广泛应用于冗余操作器的路径规划。然而，通过点数值逆运动学直接执行7-DoF臂，常常会出现步长抖动、突兀的关节转变以及在奇异构型附近的不稳定性。本研究提出了一个连接离散规划与连续执行之间的桥梁框架，而无需修改离散规划器本身。在规划方面，引入了阶梯归一化的26邻笛卡尔作用和几何平局机制，以抑制不必要的转弯并消除步长振荡。在执行端，实现了任务优先级阻尼最小二乘法（TP-DLS）逆运动学层。该层将端执行器位置视为主任务，而姿态和关节居中作为次要任务投射到空空间，结合信任区域裁剪和关节速度约束。在随机稀疏、中密度和密集环境中的7景深操作手上，该桥将密集场景中的规划成功率从约0.58提升至1.00，将代表性路径长度从约1.53米缩短至1.10米，同时将末端执行器误差控制在1毫米以下，将峰值关节加速度降低了一个数量级以上，显著提升了冗余机械臂上基于体素的强化学习路径的连续执行质量。

Optimizing RAG Rerankers with LLM Feedback via Reinforcement Learning

通过强化学习优化 RAG 重排序器，并用 LLM 反馈

Authors: Yuhang Wu, Xiangqing Shen, Fanfan Wang, Cangqi Zhou, Zhen Wu, Xinyu Dai, Rui Xia
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
Arxiv link: https://arxiv.org/abs/2604.02091
Pdf link: https://arxiv.org/pdf/2604.02091
Abstract Rerankers play a pivotal role in refining retrieval results for Retrieval-Augmented Generation. However, current reranking models are typically optimized on static human annotated relevance labels in isolation, decoupled from the downstream generation process. This isolation leads to a fundamental misalignment: documents identified as topically relevant by information retrieval metrics often fail to provide the actual utility required by the LLM for precise answer generation. To bridge this gap, we introduce ReRanking Preference Optimization (RRPO), a reinforcement learning framework that directly aligns reranking with the LLM's generation quality. By formulating reranking as a sequential decision-making process, RRPO optimizes for context utility using LLM feedback, thereby eliminating the need for expensive human annotations. To ensure training stability, we further introduce a reference-anchored deterministic baseline. Extensive experiments on knowledge-intensive benchmarks demonstrate that RRPO significantly outperforms strong baselines, including the powerful list-wise reranker RankZephyr. Further analysis highlights the versatility of our framework: it generalizes seamlessly to diverse readers (e.g., GPT-4o), integrates orthogonally with query expansion modules like Query2Doc, and remains robust even when trained with noisy supervisors.
中文摘要 重新排序器在优化检索增强生成的检索结果中起着关键作用。然而，当前的重新排序模型通常在静态人工注释相关标签上单独优化，与下游生成过程解耦。这种隔离导致了根本性的错位：通过信息检索指标被识别为主题相关的文档，往往无法提供LLM所需的精确答案生成所需的实际效用。为弥合这一差距，我们引入了重新排序偏好优化（RRPO），这是一个强化学习框架，直接将重新排序与LLM的生成质量对齐。通过将重新排序表述为顺序决策过程，RRPO利用LLM反馈优化上下文效用，从而消除了昂贵的人工注释需求。为确保训练稳定性，我们进一步引入了参考锚定的确定性基线。大量知识密集基准测试显示，RRPO的表现显著优于强基准，包括强大的按列表重新排序工具RankZephyr。进一步分析显示了我们框架的多功能性：它能够无缝推广到不同的读者（例如 GPT-4o），与 Query2Doc 等查询扩展模块正交集成，即使在噪声较大的导师指导下也能保持稳健。

Auction-Based Online Policy Adaptation for Evolving Objectives

基于拍卖的在线政策适应以适应不断变化的目标

Authors: Guruprerana Shabadi, Kaushik Mallik
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2604.02151
Pdf link: https://arxiv.org/pdf/2604.02151
Abstract We consider multi-objective reinforcement learning problems where objectives come from an identical family -- such as the class of reachability objectives -- and may appear or disappear at runtime. Our goal is to design adaptive policies that can efficiently adjust their behaviors as the set of active objectives changes. To solve this problem, we propose a modular framework where each objective is supported by a selfish local policy, and coordination is achieved through a novel auction-based mechanism: policies bid for the right to execute their actions, with bids reflecting the urgency of the current state. The highest bidder selects the action, enabling a dynamic and interpretable trade-off among objectives. Going back to the original adaptation problem, when objectives change, the system adapts by simply adding or removing the corresponding policies. Moreover, as objectives arise from the same family, identical copies of a parameterized policy can be deployed, facilitating immediate adaptation at runtime. We show how the selfish local policies can be computed by turning the problem into a general-sum game, where the policies compete against each other to fulfill their own objectives. To succeed, each policy must not only optimize its own objective, but also reason about the presence of other goals and learn to produce calibrated bids that reflect relative priority. In our implementation, the policies are trained concurrently using proximal policy optimization (PPO). We evaluate on Atari Assault and a gridworld-based path-planning task with dynamic targets. Our method achieves substantially better performance than monolithic policies trained with PPO.
中文摘要 我们考虑多目标强化学习问题，目标来自同一族——如可达目标类——且在运行时可能出现或消失。我们的目标是设计适应性政策，随着主动目标组的变化，能够高效调整其行为。为解决这一问题，我们提出了一个模块化框架，每个目标由自私的地方政策支持，协调通过一种新型的拍卖机制实现：政策竞标执行其行动的权利，竞标反映当前状态的紧迫性。出价最高者选择行动，从而实现目标间动态且可理解的权衡。回到最初的适应问题，当目标发生变化时，系统通过添加或移除相应的策略来适应。此外，由于目标源自同一家族，可以部署相同的参数化策略副本，便于运行时的即时适应。我们展示了如何通过将问题转化为一般和博弈来计算自私的局部政策，在其中政策相互竞争以实现各自的目标。要取得成功，每个政策不仅要优化自身目标，还要考虑其他目标的存在，并学会制定反映相对优先级的校准竞标。在我们的实现中，策略是同时通过近端策略优化（PPO）进行训练。我们评估了Atari Assault和基于网格世界的路径规划任务，目标动态。我们的方法比用PPO训练的单一策略表现要好得多。

Multi-Agent Video Recommenders: Evolution, Patterns, and Open Challenges

多代理视频推荐器：演变、模式与开放挑战

Authors: Srivaths Ranganathan, Abhishek Dharmaratnakar, Anushree Sinha, Debanshu Das
Subjects: Subjects: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
Arxiv link: https://arxiv.org/abs/2604.02211
Pdf link: https://arxiv.org/pdf/2604.02211
Abstract Video recommender systems are among the most popular and impactful applications of AI, shaping content consumption and influencing culture for billions of users. Traditional single-model recommenders, which optimize static engagement metrics, are increasingly limited in addressing the dynamic requirements of modern platforms. In response, multi-agent architectures are redefining how video recommender systems serve, learn, and adapt to both users and datasets. These agent-based systems coordinate specialized agents responsible for video understanding, reasoning, memory, and feedback, to provide precise, explainable recommendations. In this survey, we trace the evolution of multi-agent video recommendation systems (MAVRS). We combine ideas from multi-agent recommender systems, foundation models, and conversational AI, culminating in the emerging field of large language model (LLM)-powered MAVRS. We present a taxonomy of collaborative patterns and analyze coordination mechanisms across diverse video domains, ranging from short-form clips to educational platforms. We discuss representative frameworks, including early multi-agent reinforcement learning (MARL) systems such as MMRF and recent LLM-driven architectures like MACRec and Agent4Rec, to illustrate these patterns. We also outline open challenges in scalability, multimodal understanding, incentive alignment, and identify research directions such as hybrid reinforcement learning-LLM systems, lifelong personalization and self-improving recommender systems.
中文摘要 视频推荐系统是人工智能最受欢迎且影响深远的应用之一，塑造着数十亿用户的内容消费和文化。传统的单模型推荐器，优化静态参与度指标，但其在满足现代平台动态需求方面越来越有限。为此，多智能体架构正在重新定义视频推荐系统服务、学习和适应用户和数据集的方式。这些基于代理的系统协调负责视频理解、推理、记忆和反馈的专业代理，提供精准且可解释的建议。在本次调查中，我们追溯了多代理视频推荐系统（MAVRS）的发展历程。我们结合了多智能体推荐系统、基础模型和对话式人工智能的理念，最终发展出新兴的大型语言模型（LLM）驱动的MAVRS领域。我们提出了协作模式的分类法，并分析了从短视频到教育平台等多种视频领域的协调机制。我们讨论了代表性框架，包括早期的多智能体强化学习（MARL）系统，如MMRF，以及近期的大型语言模型驱动架构如MACRec和Agent4Rec，以说明这些模式。我们还概述了可扩展性、多模态理解、激励对齐等未解决的挑战，并确定了混合强化学习-大型语言模型系统、终身个性化和自我改进推荐系统等研究方向。

When to ASK: Uncertainty-Gated Language Assistance for Reinforcement Learning

何时提出：不确定门控语言辅助强化学习

Authors: Juarez Monteiro, Nathan Gavenski, Gianlucca Zuin, Adriano Veloso
Subjects: Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2604.02226
Pdf link: https://arxiv.org/pdf/2604.02226
Abstract Reinforcement learning (RL) agents often struggle with out-of-distribution (OOD) scenarios, leading to high uncertainty and random behavior. While language models (LMs) contain valuable world knowledge, larger ones incur high computational costs, hindering real-time use, and exhibit limitations in autonomous planning. We introduce Adaptive Safety through Knowledge (ASK), which combines smaller LMs with trained RL policies to enhance OOD generalization without retraining. ASK employs Monte Carlo Dropout to assess uncertainty and queries the LM for action suggestions only when uncertainty exceeds a set threshold. This selective use preserves the efficiency of existing policies while leveraging the language model's reasoning in uncertain situations. In experiments on the FrozenLake environment, ASK shows no improvement in-domain, but demonstrates robust navigation in transfer tasks, achieving a reward of 0.95. Our findings indicate that effective neuro-symbolic integration requires careful orchestration rather than simple combination, highlighting the need for sufficient model scale and effective hybridization mechanisms for successful OOD generalization.
中文摘要 强化学习（RL）代理常常在分布外（OOD）场景下遇到困难，导致高度不确定性和随机行为。虽然语言模型（LM）包含宝贵的世界知识，但较大的模型计算成本高，阻碍实时使用，并在自主规划方面存在局限性。我们引入了“知识自适应安全”（ASK），将较小的LM与经过培训的强化学习策略结合起来，以增强OOD泛化能力，无需再培训。ASK采用蒙特卡洛Dropout算法评估不确定性，只有当不确定性超过设定阈值时，才向LM查询行动建议。这种选择性使用既保持了现有策略的效率，又在不确定的情况下利用了语言模型的推理能力。在FrozenLake环境的实验中，ASK在域内无明显提升，但在传输任务中展现出稳健的导航能力，获得了0.95的奖励。我们的发现表明，有效的神经符号整合需要精心编排而非简单组合，凸显了成功实现OOD泛化所需的足够模型规模和有效的混合机制。

Model-Based Reinforcement Learning for Control under Time-Varying Dynamics

基于模型的强化学习用于时间变化动力学控制

Authors: Klemens Iten, Bruce Lee, Chenhao Li, Lenart Treven, Andreas Krause, Bhavya Sukhija
Subjects: Subjects: Machine Learning (cs.LG); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2604.02260
Pdf link: https://arxiv.org/pdf/2604.02260
Abstract Learning-based control methods typically assume stationary system dynamics, an assumption often violated in real-world systems due to drift, wear, or changing operating conditions. We study reinforcement learning for control under time-varying dynamics. We consider a continual model-based reinforcement learning setting in which an agent repeatedly learns and controls a dynamical system whose transition dynamics evolve across episodes. We analyze the problem using Gaussian process dynamics models under frequentist variation-budget assumptions. Our analysis shows that persistent non-stationarity requires explicitly limiting the influence of outdated data to maintain calibrated uncertainty and meaningful dynamic regret guarantees. Motivated by these insights, we propose a practical optimistic model-based reinforcement learning algorithm with adaptive data buffer mechanisms and demonstrate improved performance on continuous control benchmarks with non-stationary dynamics.
中文摘要 基于学习的控制方法通常假设系统动力学是固定的，而这一假设在现实系统中常因漂移、磨损或工作条件变化而被打破。我们研究在时间变化动力学下控制的强化学习。我们考虑一种持续基于模型的强化学习环境，其中代理反复学习并控制一个动态系统，其过渡动态会随着剧集演变。我们利用高斯过程动力学模型在频率主义变分预算假设下分析该问题。我们的分析表明，持续的非平稳性需要明确限制过时数据的影响，以保持校准的不确定性和有意义的动态遗憾保证。基于这些见解，我们提出了一种实用的乐观模型基础强化学习算法，具有自适应数据缓冲机制，并在非平稳动力学连续控制基准测试中性能提升。

SKILL0: In-Context Agentic Reinforcement Learning for Skill Internalization

SKILL0：情境内能动强化学习，用于技能内化

Authors: Zhengxi Lu, Zhiyuan Yao, Jinyang Wu, Chengcheng Han, Qi Gu, Xunliang Cai, Weiming Lu, Jun Xiao, Yueting Zhuang, Yongliang Shen
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2604.02268
Pdf link: https://arxiv.org/pdf/2604.02268
Abstract Agent skills, structured packages of procedural knowledge and executable resources that agents dynamically load at inference time, have become a reliable mechanism for augmenting LLM agents. Yet inference-time skill augmentation is fundamentally limited: retrieval noise introduces irrelevant guidance, injected skill content imposes substantial token overhead, and the model never truly acquires the knowledge it merely follows. We ask whether skills can instead be internalized into model parameters, enabling zero-shot autonomous behavior without any runtime skill retrieval. We introduce SKILL0, an in-context reinforcement learning framework designed for skill internalization. SKILL0 introduces a training-time curriculum that begins with full skill context and progressively withdraws it. Skills are grouped offline by category and rendered with interaction history into a compact visual context, teaching he model tool invocation and multi-turn task completion. A Dynamic Curriculum then evaluates each skill file's on-policy helpfulness, retaining only those from which the current policy still benefits within a linearly decaying budget, until the agent operates in a fully zero-shot setting. Extensive agentic experiments demonstrate that SKILL0 achieves substantial improvements over the standard RL baseline (+9.7\% for ALFWorld and +6.6\% for Search-QA), while maintaining a highly efficient context of fewer than 0.5k tokens per step. Our code is available at this https URL.
中文摘要 代理技能，即代理在推理时动态加载的结构化程序知识和可执行资源包，已成为增强LLM代理的可靠机制。然而，推理时间技能增强本质上是有限的：检索噪声引入无关的指导，注入的技能内容带来大量代币开销，模型从未真正获得它仅仅遵循的知识。我们探讨技能是否可以内化到模型参数中，实现零射自动行为，无需运行时技能检索。我们介绍SKILL0，一个旨在内化技能的情境强化学习框架。SKILL0引入了一套培训时间课程，从完整的技能背景开始，逐步撤回。技能按类别离线分组，并结合交互历史呈现为紧凑的视觉语境，教授模型工具调用和多回合任务完成。动态课程随后评估每个技能档案的保单适用性，只保留那些在线性递减预算内仍被当前保单受益的技能，直到代理人在完全零机会的环境中运作。大量代理实验表明，SKILL0 相较标准强化学习基线实现了显著改进（ALFWorld 为 +9.7%，Search-QA 为 +6.6\%），同时保持每步少于 0.5k 个令牌的高效上下文。我们的代码可在此 https URL 访问。

CIVIC: Cooperative Immersion Via Intelligent Credit-sharing in DRL-Powered Metaverse

CIVIC：通过智能信用共享实现的合作沉浸式，在日日学习驱动的元宇宙中实现

Authors: Amr Aboeleneen, Mohamed Abdallah, Aiman Erbad, Amr Salem
Subjects: Subjects: Networking and Internet Architecture (cs.NI)
Arxiv link: https://arxiv.org/abs/2604.02284
Pdf link: https://arxiv.org/pdf/2604.02284
Abstract The Metaverse faces complex resource allocation challenges due to diverse Virtual Environments (VEs), Digital Twins (DTs), dynamic user demands, and strict immersion needs. This paper introduces CIVIC (Cooperative Immersion Via Intelligent Credit-sharing), a novel framework optimizing resource sharing among multiple Metaverse Service Providers (MSPs) to enhance user immersion. Unlike existing methods, CIVIC integrates VE rendering, DT synchronization, credit sharing, and immersion-aware provisioning within a cooperative multi-MSP model. The resource allocation problem is formulated as two NP-hard challenges: a non-cooperative setting where MSPs operate independently and a cooperative setting utilizing a General Credit Pool (GCP) for dynamic resource sharing. Using Deep Reinforcement Learning (DRL) for tuning resources and managing cooperating MSPs, CIVIC achieves 12-36% higher request completion, 23-70% higher fulfillment rates, 20-60% more served clients, and up to 51% more fairly distributed requests, all with competitive costs. Extensive experiments demonstrate CIVIC's resilience, adaptability, and robust performance under dynamic load conditions and unexpected demand surges, making it suitable for real-world distributed Metaverse infrastructures.
中文摘要 元宇宙面临复杂的资源分配挑战，源于多样化的虚拟环境（VE）、数字孪生（DT）、动态的用户需求以及严格的沉浸感需求。本文介绍了CIVIC（通过智能信用共享实现合作沉浸），这是一个新颖框架，优化多个元宇宙服务提供商（MSP）之间的资源共享，以增强用户沉浸感。与现有方法不同，CIVIC将VE渲染、DT同步、信用共享和沉浸感配置整合在合作多MSP模型中。资源分配问题被表述为两个NP难挑战：一个是MSP独立运营的非合作环境，另一个是利用通用信贷池（GCP）进行动态资源共享的合作环境。通过深度强化学习（DRL）调整资源和管理合作的MSP，CIVIC实现了12-36%的请求完成率提升，23-70%的履行率提升，服务客户数量增加20-60%，且公平分布请求提升了多达51%，所有这些都具有竞争力的成本。大量实验证明了CIVIC在动态负载和突发需求激增下的韧性、适应性和稳健性能，使其适合现实世界的分布式元宇宙基础设施。

Unifying Group-Relative and Self-Distillation Policy Optimization via Sample Routing

通过样本路由统一群相对和自蒸馏策略优化

Authors: Gengsheng Li, Tianyu Yang, Junfeng Fang, Mingyang Song, Mao Zheng, Haiyun Guo, Dan Zhang, Jinqiao Wang, Tat-Seng Chua
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2604.02288
Pdf link: https://arxiv.org/pdf/2604.02288
Abstract Reinforcement learning with verifiable rewards (RLVR) has become a standard paradigm for post-training large language models. While Group Relative Policy Optimization (GRPO) is widely adopted, its coarse credit assignment uniformly penalizes failed rollouts, lacking the token-level focus needed to efficiently address specific deviations. Self-Distillation Policy Optimization (SDPO) addresses this by providing denser, more targeted logit-level supervision that facilitates rapid early improvement, yet it frequently collapses during prolonged training. We trace this late-stage instability to two intrinsic flaws: self-distillation on already-correct samples introduces optimization ambiguity, and the self-teacher's signal reliability progressively degrades. To resolve these issues, we propose Sample-Routed Policy Optimization (SRPO), a unified on-policy framework that routes correct samples to GRPO's reward-aligned reinforcement and failed samples to SDPO's targeted logit-level correction. SRPO further incorporates an entropy-aware dynamic weighting mechanism to suppress high-entropy, unreliable distillation targets while emphasizing confident ones. Evaluated across five benchmarks and two model scales, SRPO achieves both the rapid early improvement of SDPO and the long-horizon stability of GRPO. It consistently surpasses the peak performance of both baselines, raising the five-benchmark average on Qwen3-8B by 3.4% over GRPO and 6.3% over SDPO, while simultaneously yielding moderate response lengths and lowering per-step compute cost by up to 17.2%.
中文摘要 带有可验证奖励的强化学习（RLVR）已成为大型语言模型后训练的标准范式。虽然集团相对策略优化（GRPO）被广泛采用，但其粗略的信用分配普遍惩罚失败的推广，缺乏有效解决特定偏差所需的代币层级关注。自蒸馏策略优化（SDPO）通过提供更密集、更有针对性的logit级监督来解决这个问题，促进快速早期改进，但在长期培训中常常会崩溃。我们将这种后期阶段的不稳定性归因于两个内在缺陷：对已正确样本进行自我蒸馏会带来优化模糊性，且自学者的信号可靠性逐渐下降。为解决这些问题，我们提出了样本路由策略优化（SRPO），这是一个统一的策略内框架，将正确样本路由到GRPO的奖励对齐强化，失败的样本路由到SDPO的目标logit级修正。SRPO还采用了熵感知的动态加权机制，以抑制高熵且不可靠的蒸馏目标，同时强调自信的目标。SRPO在五个基准测试和两个模型尺度上进行了评估，既实现了SDPO的早期快速改进，也实现了GRPO的长期稳定性。它持续超过两个基线的峰值表现，使Qwen3-8B五个基准的平均值比GRPO提高了3.4%，比SDPO提高了6.3%，同时响应长度适中，每步计算成本降低了最多17.2%。

Beyond Referring Expressions: Scenario Comprehension Visual Grounding

超越指称表达：情境理解视觉基础

Authors: Ruozhen He, Nisarg A. Shah, Qihua Dong, Zilin Xiao, Jaywon Koo, Vicente Ordonez
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2604.02323
Pdf link: https://arxiv.org/pdf/2604.02323
Abstract Existing visual grounding benchmarks primarily evaluate alignment between image regions and literal referring expressions, where models can often succeed by matching a prominent named category. We explore a complementary and more challenging setting of scenario-based visual grounding, where the target must be inferred from roles, intentions, and relational context rather than explicit naming. We introduce Referring Scenario Comprehension (RSC), a benchmark designed for this setting. The queries in this benchmark are paragraph-length texts that describe object roles, user goals, and contextual cues, including deliberate references to distractor objects that often require deep understanding to resolve. Each instance is annotated with interpretable difficulty tags for uniqueness, clutter, size, overlap, and position which expose distinct failure modes and support fine-grained analysis. RSC contains approximately 31k training examples, 4k in-domain test examples, and a 3k out-of-distribution split with unseen object categories. We further propose ScenGround, a curriculum reasoning method serving as a reference point for this setting, combining supervised warm-starting with difficulty-aware reinforcement learning. Experiments show that scenario-based queries expose systematic failures in current models that standard benchmarks do not reveal, and that curriculum training improves performance on challenging slices and transfers to standard benchmarks.
中文摘要 现有的视觉基础基准主要评估图像区域与字面指称表达式之间的对齐，模型通常通过匹配显著的命名类别而成功。我们探索一种互补且更具挑战性的情景视觉基础环境，目标必须从角色、意图和关系语境中推断，而非明确命名。我们引入了为该环境设计的基准测试——引荐情景理解（RSC）。该基准测试中的查询为段落长度的文本，描述对象角色、用户目标和上下文线索，包括有意引用通常需要深度理解才能解决的干扰对象。每个实例都标注了可解释的难度标签，涵盖唯一性、杂乱性、大小、重叠和位置，揭示了不同的失效模式并支持细致分析。RSC包含约3.1万个训练示例，4千个域内测试示例，以及3千个分布外的未见对象类别拆分。我们还提出了ScenGround，一种课程推理方法，作为该环境的参考点，结合了监督热启与困难感知强化学习。实验表明，基于情景的查询揭示了当前模型中标准基准无法揭示的系统性失败，且课程培训能提升在具有挑战性的切片和向标准基准的转移时的表现。

Keyword: diffusion policy

There is no result