Arxiv Papers of Today

生成时间: 2026-04-02 17:00:51 (UTC+8); Arxiv 发布时间: 2026-04-02 20:00 EDT (2026-04-03 08:00 UTC+8)

今天共有 36 篇相关文章

Keyword: reinforcement learning

MSA-Thinker: Discrimination-Calibration Reasoning with Hint-Guided Reinforcement Learning for Multimodal Sentiment Analysis

MSA-Thinker：多模态情感分析中的区分-校准推理与提示引导强化学习

Authors: Miaosen Luo, Zhenhao Yang, Jieshen Long, Jinghu Sun, Yichu Liu, Sijie Mai
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2604.00013
Pdf link: https://arxiv.org/pdf/2604.00013
Abstract Multimodal sentiment analysis aims to understand human emotions by integrating textual, auditory, and visual modalities. Although Multimodal Large Language Models (MLLMs) have achieved state-of-the-art performance via supervised fine-tuning (SFT), their end-to-end "black-box" nature limits interpretability. Existing methods incorporating Chain-of-Thought (CoT) reasoning are hindered by high annotation costs, while Reinforcement Learning (RL) faces challenges such as low exploration efficiency and sparse rewards, particularly on hard samples. To address these issues, we propose a novel training framework that integrates structured Discrimination-Calibration (DC) reasoning with Hint-based Reinforcement Learning. First, we perform cold-start SFT using high-quality CoT data synthesized by a teacher model (Qwen3Omni-30B), which inherently contains the DC structure. This equips the model with a reasoning paradigm that performs macro discrimination followed by fine-grained calibration from the initial stage. Building on this, we propose Hint-GRPO, which leverages the discrimination phase within the DC structure as a verifiable anchor during RL to provide directional hints for hard samples, guiding policy optimization and effectively mitigating the reward sparsity problem. Experiments on the Qwen2.5Omni-7B model demonstrate that our method not only achieves higher accuracy in fine-grained sentiment regression tasks but also generates high-quality structured reasoning chains. Crucially, it exhibits superior generalization capability in cross-domain evaluations. This enhances model interpretability while validating the positive contribution of explicit reasoning steps to model robustness, offering a new paradigm for building trustworthy and efficient sentiment analysis systems.
中文摘要 多模态情感分析旨在通过整合文本、听觉和视觉模态来理解人类情感。尽管多模态大型语言模型（MLLM）通过监督微调（SFT）实现了最先进的性能，但其端到端“黑箱”特性限制了可解释性。现有采用思维链（Chain-of-Thought，CoT）推理的方法受限于高注释成本，而强化学习（RL）则面临探索效率低和奖励稀疏等挑战，尤其是在硬样本上。为解决这些问题，我们提出了一种新颖的培训框架，将结构化辨别-校准（DC）推理与基于提示的强化学习相结合。首先，我们使用由教师模型（Qwen3Omni-30B）合成的高质量CoT数据进行冷启动SFT，该模型本身包含DC结构。这为模型配备了一套推理范式，先进行宏观鉴别，随后从初始阶段进行细粒度校准。基于此，我们提出了Hint-GRPO，利用DC结构中的辨别阶段作为强化学习期间可验证的锚点，为硬样本提供方向性提示，指导策略优化，有效缓解奖励稀疏问题。Qwen2.5Omni-7B模型的实验表明，我们的方法不仅在细粒度情感回归任务中实现了更高的准确性，还能生成高质量的结构化推理链。关键是，它在跨领域评估中展现出更优越的泛化能力。这增强了模型的可解释性，同时验证了显式推理步骤对建模鲁棒性的积极贡献，为构建可信且高效的情感分析系统提供了新的范式。

Generalizable Dense Reward for Long-Horizon Robotic Tasks

对长期机器人任务的可推广密集奖励

Authors: Silong Yong, Stephen Sheng, Carl Qi, Xiaojie Wang, Evan Sheehan, Anurag Shivaprasad, Yaqi Xie, Katia Sycara, Yesh Dattatreya
Subjects: Subjects: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2604.00055
Pdf link: https://arxiv.org/pdf/2604.00055
Abstract Existing robotic foundation policies are trained primarily via large-scale imitation learning. While such models demonstrate strong capabilities, they often struggle with long-horizon tasks due to distribution shift and error accumulation. While reinforcement learning (RL) can finetune these models, it cannot work well across diverse tasks without manual reward engineering. We propose VLLR, a dense reward framework combining (1) an extrinsic reward from Large Language Models (LLMs) and Vision-Language Models (VLMs) for task progress recognition, and (2) an intrinsic reward based on policy self-certainty. VLLR uses LLMs to decompose tasks into verifiable subtasks and then VLMs to estimate progress to initialize the value function for a brief warm-up phase, avoiding prohibitive inference cost during full training; and self-certainty provides per-step intrinsic guidance throughout PPO finetuning. Ablation studies reveal complementary benefits: VLM-based value initialization primarily improves task completion efficiency, while self-certainty primarily enhances success rates, particularly on out-of-distribution tasks. On the CHORES benchmark covering mobile manipulation and navigation, VLLR achieves up to 56% absolute success rate gains over the pretrained policy, up to 5% gains over state-of-the-art RL finetuning methods on in-distribution tasks, and up to $10\%$ gains on out-of-distribution tasks, all without manual reward engineering. Additional visualizations can be found in this https URL
中文摘要 现有的机器人基础策略主要通过大规模模仿学习进行训练。虽然这些模型展现出强大的能力，但由于分布偏移和误差累积，它们在长视野任务中常常遇到困难。虽然强化学习（RL）可以微调这些模型，但如果没有人工奖励工程，它无法在多样化任务中有效运作。我们提出了VLLR，这是一种密集奖励框架，结合了（1）来自大型语言模型（LLMs）和视觉语言模型（VLMs）的外在奖励，用于任务进展的识别，以及（2）基于政策自确定性的内在奖励。VLLR使用LLM将任务分解为可验证的子任务，然后VLM用于估算进度，以初始化值函数，用于短暂的热身阶段，避免了在完整训练期间产生的高估推断成本;自信心在PPO微调过程中为每步提供内在指导。消融研究显示出互补的益处：基于VLM的价值初始化主要提升任务完成效率，而自确定性主要提升成功率，尤其是在分布外任务中。在涵盖移动操作和导航的 CHORES 基准测试中，VLLR 在预训练策略上实现了高达 56% 的绝对成功率提升，在分布内任务中，对最先进的强化学习微调方法提升高达 5%，在分布外任务中提升高达 10% 美元，且均无需人工奖励工程。更多可视化内容可在此 https 网址中找到

Evolution Strategies for Deep RL pretraining

深度强化学习预训练的进化策略

Authors: Adrian Martínez, Ananya Gupta, Hanka Goralija, Mario Rico, Saúl Fenollosa, Tamar Alphaidze
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2604.00066
Pdf link: https://arxiv.org/pdf/2604.00066
Abstract Although Deep Reinforcement Learning has proven highly effective for complex decision-making problems, it demands significant computational resources and careful parameter adjustment in order to develop successful strategies. Evolution strategies offer a more straightforward, derivative-free approach that is less computationally costly and simpler to deploy. However, ES generally do not match the performance levels achieved by DRL, which calls into question their suitability for more demanding scenarios. This study examines the performance of ES and DRL across tasks of varying difficulty, including Flappy Bird, Breakout and Mujoco environments, as well as whether ES could be used for initial training to enhance DRL algorithms. The results indicate that ES do not consistently train faster than DRL. When used as a preliminary training step, they only provide benefits in less complex environments (Flappy Bird) and show minimal or no improvement in training efficiency or stability across different parameter settings when applied to more sophisticated tasks (Breakout and MuJoCo Walker).
中文摘要 尽管深度强化学习在复杂决策问题上已被证明非常有效，但为了制定成功的策略，它需要大量的计算资源和细致的参数调整。进化策略提供了更直接、无衍生性的方法，计算成本更低且部署更简单。然而，ES通常无法达到日行车（DRL）所达到的性能水平，这也引发了其在更高要求场景中的适用性受到质疑。本研究考察了ES和DRL在不同难度任务中的表现，包括Flappy Bird、Breakout和Mujoco环境，以及ES是否可用于初始训练以增强DRL算法。结果显示，ES的训练速度并不总是快于DRL。作为初步训练步骤，它们仅在较不复杂的环境中（Flappy Bird）中有效，应用于更复杂的任务时，训练效率或稳定性在不同参数设置下几乎没有提升（Breakout和MuJoCo Walker）。

Learning to Play Blackjack: A Curriculum Learning Perspective

学习玩二十一点：课程学习视角

Authors: Amirreza Alasti, Efe Erdal, Yücel Celik, Theresa Eimer
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2604.00076
Pdf link: https://arxiv.org/pdf/2604.00076
Abstract Reinforcement Learning (RL) agents often struggle with efficiency and performance in complex environments. We propose a novel framework that uses a Large Language Model (LLM) to dynamically generate a curriculum over available actions, enabling the agent to incorporate each action individually. We apply this framework to the game of Blackjack, where the LLM creates a multi-stage training path that progressively introduces complex actions to a Tabular Q-Learning and a Deep Q-Network (DQN) agent. Our evaluation in a realistic 8-deck simulation over 10 independent runs demonstrates significant performance gains over standard training methods. The curriculum-based approach increases the DQN agent's average win rate from 43.97% to 47.41%, reduces the average bust rate from 32.9% to 28.0%, and accelerates the overall workflow by over 74%, with the agent's full training completing faster than the baseline's evaluation phase alone. These results validate that LLM-guided curricula can build more effective, robust, and efficient RL agents.
中文摘要 强化学习（RL）代理在复杂环境中常常面临效率和性能上的困难。我们提出了一种新颖框架，利用大型语言模型（LLM）动态生成课程，使智能体能够单独整合每个动作。我们将该框架应用于二十一点游戏中，LLM创建了多阶段训练路径，逐步向表式Q-Learning和深度Q-Network（DQN）代理引入复杂操作。我们在10次独立运行的8层甲板模拟中，评估显示其性能优于标准训练方法。基于课程的方法将DQN代理的平均胜率从43.97%提升至47.41%，平均破败率从32.9%降至28.0%，整体工作流程加速超过74%，代理的完整培训完成速度超过基线评估阶段。这些结果验证了以LLM为导向的课程能够构建更有效、更稳健、更高效的强化学习代理。

Finite-Time Analysis of Projected Two-Time-Scale Stochastic Approximation

预测两时间尺度随机近似的有限时间分析

Authors: Yitao Bai, Thinh T. Doan, Justin Romberg
Subjects: Subjects: Systems and Control (eess.SY); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2604.00179
Pdf link: https://arxiv.org/pdf/2604.00179
Abstract We study the finite-time convergence of projected linear two-time-scale stochastic approximation with constant step sizes and Polyak--Ruppert averaging. We establish an explicit mean-square error bound, decomposing it into two interpretable components, an approximation error determined by the constrained subspace and a statistical error decaying at a sublinear rate, with constants expressed through restricted stability margins and a coupling invertibility condition. These constants cleanly separate the effect of subspace choice (approximation errors) from the effect of the averaging horizon (statistical errors). We illustrate our theoretical results through a number of numerical experiments on both synthetic and reinforcement learning problems.
中文摘要 我们研究了投影线性二时间尺度随机近似与常步长和波利亚克-鲁珀特平均的有限时间收敛性。我们建立了显式的均方误差界限，将其分解为两个可解释的分量：由受约束子空间确定的近似误差和以亚线性速率衰减的统计误差，常数通过受限的稳定裕度和耦合可逆条件表示。这些常数清晰地区分了子空间选择（近似误差）与平均视界（统计误差）的影响。我们通过多项关于合成学习和强化学习问题的数值实验来展示我们的理论结果。

Offline Constrained RLHF with Multiple Preference Oracles

带有多优先预言机的离线约束RLHF

Authors: Brenden Latham, Mehrdad Moharrami
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2604.00200
Pdf link: https://arxiv.org/pdf/2604.00200
Abstract We study offline constrained reinforcement learning from human feedback with multiple preference oracles. Motivated by applications that trade off performance with safety or fairness, we aim to maximize target population utility subject to a minimum protected group welfare constraint. From pairwise comparisons collected under a reference policy, we estimate oracle-specific rewards via maximum likelihood and analyze how statistical uncertainty propagates through the dual program. We cast the constrained objective as a KL-regularized Lagrangian whose primal optimizer is a Gibbs policy, reducing learning to a convex dual problem. We propose a dual-only algorithm that ensures high-probability constraint satisfaction and provide the first finite-sample performance guarantees for offline constrained preference learning. Finally, we extend our theoretical analysis to accommodate multiple constraints and general f-divergence regularization.
中文摘要 我们研究了基于多偏好预言机的离线约束强化学习，基于人类反馈。基于在安全性或公平性之间权衡性能的应用，我们旨在最大化目标群体效用，同时对保护群体的福利约束要求最低。通过参考策略收集的两两比较，我们通过最大似然估计了预言机特有的奖励，并分析了统计不确定性如何通过双重程序传播。我们将受约束目标表定为一个KL正则化拉格朗日量，其原始优化器为Gibbs策略，从而将学习简化为凸对偶问题。我们提出了一种仅对偶的算法，确保高概率约束满足，并首次为离线约束偏好学习提供有限样本性能保证。最后，我们将理论分析扩展到适应多约束和一般的f散度正则化。

Scalable machine learning-based approaches for energy saving in densely deployed Open RAN

基于机器学习的可扩展方法，在密集部署的开放无线网络中实现节能方法

Authors: Xuanyu Liang, Ahmed Al-Tahmeesschi, Swarna Chetty, Cicek Cavdar, Berk Canberk, Hamed Ahmadi
Subjects: Subjects: Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2604.00201
Pdf link: https://arxiv.org/pdf/2604.00201
Abstract Densely deployed base stations are responsible for the majority of the energy consumed in Radio access network (RAN). While these deployments are crucial to deliver the required data rate in busy hours of the day, the network can save energy by switching some of them to sleep mode and maintain the coverage and quality of service with the other ones. Benefiting from the flexibility provided by the Open RAN in embedding machine learning (ML) in network operations, in this work we propose Deep Reinforcement Learning (DRL)-based energy saving solutions. Firstly we propose 3 different DRL-based methods in the form of xApps which control the Active/Sleep mode of up to 6 radio units (RUs) from Near Real time RAN Intelligent Controller (RIC). We also propose a further scalable federated DRL-based solution with an aggregator as an rApp in None Real time RIC and local agents as xApps. Our simulation results present the convergence of the proposed methods. We also compare the performance of our federated DRL across three layouts spanning 6--24 RUs and 500--1000\,m regions, including a composite multi-region scenario. The results show that our proposed federated TD3 algorithm achieves up to 43.75\% faster convergence, more than 50\% network energy saving and 37. 4\% lower training energy versus centralized baselines, while maintaining the quality of service and improving the robustness of the policy.
中文摘要 密集部署的基站承担了无线接入网络（RAN）大部分的能源消耗。虽然这些部署对于在一天中繁忙时段提供所需的数据速率至关重要，但网络可以通过将部分部署切换到睡眠模式，同时保持覆盖和服务质量来节省能源。借助Open RAN在将机器学习（ML）嵌入网络操作中的灵活性，本研究提出了基于深度强化学习（DRL）的节能解决方案。首先，我们提出了三种基于日程学习（DRL）的方法，即xApps，它们控制来自近实时RAN智能控制器（RIC）最多6个无线电单元（RU）的主动/睡眠模式。我们还提出了一个可扩展的联邦DRL解决方案，聚合器作为非实时RIC中的rApp，本地代理作为xApps。我们的模拟结果展示了这些方法的趋同。我们还比较了联邦日日车在三种布局中的性能，涵盖6-24 RU和500-1000/米区域，包括复合多区域场景。结果显示，我们提出的联邦TD3算法收敛速度高达43.75%，网络节能超过50%，37%。相比集中式基线，培训能量降低4/%，同时保持服务质量并提升政策的稳健性。

Autonomous Adaptive Solver Selection for Chemistry Integration via Reinforcement Learning

通过强化学习实现化学集成的自主自适应求解器选择

Authors: Eloghosa Ikponmwoba, Opeoluwa Owoyele
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2604.00264
Pdf link: https://arxiv.org/pdf/2604.00264
Abstract The computational cost of stiff chemical kinetics remains a dominant bottleneck in reacting-flow simulation, yet hybrid integration strategies are typically driven by hand-tuned heuristics or supervised predictors that make myopic decisions from instantaneous local state. We introduce a constrained reinforcement learning (RL) framework that autonomously selects between an implicit BDF integrator (CVODE) and a quasi-steady-state (QSS) solver during chemistry integration. Solver selection is cast as a Markov decision process. The agent learns trajectory-aware policies that account for how present solver choices influence downstream error accumulation, while minimizing computational cost under a user-prescribed accuracy tolerance enforced through a Lagrangian reward with online multiplier adaptation. Across sampled 0D homogeneous reactor conditions, the RL-adaptive policy achieves a mean speedup of approximately $3\times$, with speedups ranging from $1.11\times$ to $10.58\times$, while maintaining accurate ignition delays and species profiles for a 106-species \textit{n}-dodecane mechanism and adding approximately $1\%$ inference overhead. Without retraining, the 0D-trained policy transfers to 1D counterflow diffusion flames over strain rates $10$--$2000~\mathrm{s}^{-1}$, delivering consistent $\approx 2.2\times$ speedup relative to CVODE while preserving near-reference temperature accuracy and selecting CVODE at only $12$--$15\%$ of space-time points. Overall, the results demonstrate the potential of the proposed reinforcement learning framework to learn problem-specific integration strategies while respecting accuracy constraints, thereby opening a pathway toward adaptive, self-optimizing workflows for multiphysics systems with spatially heterogeneous stiffness.
中文摘要 刚性化学动力学的计算成本仍然是反应流模拟的主要瓶颈，但混合集成策略通常由手工调优的启发式或监督预测器驱动，这些预测器从瞬时局部状态做出目光短浅的决策。我们引入了一种约束强化学习（RL）框架，在化学积分过程中自主选择隐式BDF积分器（CVODE）和准稳态（QSS）求解器。求解器选择被视为马尔可夫决策过程。智能体学习轨迹感知策略，考虑当前求解器选择如何影响后续错误累积，同时通过在线乘数自适应强制执行的用户规定精度容忍度，最小化计算成本。在采样的0D均质反应堆条件下，RL自适应策略的平均加速约为3美元，加速幅度从1.11美元到10.58美元不等，同时保持106物种\textit{n}-十二烷机制的准确点火延迟和物种剖析，并增加了约1%%美元的推断开销。无需重新训练，0D训练策略可转换为1D逆流扩散火焰，响应速率为$10$-$2000~\mathrm{s}^{-1}$，相较CVODE持续提升约$2.2\倍，同时保持近参考温度精度，选择CVODE仅为$12–$15\%$的时空点。总体而言，结果表明所提出的强化学习框架在尊重准确性约束的前提下学习特定问题的积分策略，从而为具有空间异质刚度的多物理系统开辟了自适应、自我优化工作流的路径。

Certified Set Convergence for Piecewise Affine Systems via Neural Lyapunov Functions

通过神经里雅普诺夫函数实现分段仿射系统的认证集合收敛

Authors: Yanliang Huang, Peng Xie, Zhen Zhang, Wenyuan Wu, Zhuoqi Zeng, Amr Alanwar
Subjects: Subjects: Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2604.00286
Pdf link: https://arxiv.org/pdf/2604.00286
Abstract Safety-critical control of piecewise affine (PWA) systems under bounded additive disturbances requires guarantees not for individual states but for entire state sets simultaneously: a single control action must steer every state in the set toward a target, even as sets crossing mode boundaries split and evolve under distinct affine dynamics. Certifying such set convergence via neural Lyapunov functions couples the Lipschitz constants of the value function and the policy, yet certified bounds for expressive networks exceed true values by orders of magnitude, creating a certification barrier. We resolve this through a three-stage pipeline that decouples verification from the policy. A value function from Hamilton-Jacobi backward reachability, trained via reinforcement learning, is the Lyapunov candidate. A permutation-invariant Deep Sets controller, distilled via regret minimization, produces a common action. Verification propagates zonotopes through the value network, yielding verified Lyapunov upper bounds over entire sets without bounding the policy Lipschitz constant. On four benchmarks up to dimension six, including systems with per-mode operator norms exceeding unity, the framework certifies set convergence with positive margin on every system. A spectrally constrained local certificate completes the terminal guarantee, and the set-actor is the only tested method to achieve full strict set containment, at constant-time online cost.
中文摘要 在有界加性扰动下，对分段仿射（PWA）系统进行安全关键控制，需要同时保证整个状态集：一个控制动作必须将集合中的每个状态引导向目标，即使跨模式边界的集合在不同仿射动力学下分裂和演化。通过神经Lyapunov函数认证此类集合收敛，将价值函数的利普希茨常数与策略耦合，但表达型网络的认证界限比真实值高出数量级，形成认证障碍。我们通过三阶段流程解决这个问题，将验证与政策解耦。通过强化学习训练的Hamilton-Jacobi向后可达性中的价值函数是Lyapunov候选函数。一个置换不变的深度集控制器，通过遗憾最小化提炼，产生一个共同动作。验证通过值网络传播带胞体，得到整个集合的经过验证的李雅普诺夫上界，而无需对策略利普希茨常数施加界限。在四个基准测试中，包括每模数算子范数超过一的系统，该框架在每个系统上都证明集合收敛且有正的边际。一个频谱约束的本地证书完成了终端保证，而集合演员是唯一经过测试以恒定时间在线成本实现完全严格集合包含的方法。

Agent Q-Mix: Selecting the Right Action for LLM Multi-Agent Systems through Reinforcement Learning

Agent Q-Mix：通过强化学习选择LLM多智能体系统的正确动作

Authors: Eric Hanchen Jiang, Levina Li, Rui Sun, Xiao Liang, Yubei Li, Yuchen Wu, Haozheng Luo, Hengli Li, Zhi Zhang, Zhaolu Kang, Kai-Wei Chang, Ying Nian Wu
Subjects: Subjects: Computation and Language (cs.CL); Applications (stat.AP)
Arxiv link: https://arxiv.org/abs/2604.00344
Pdf link: https://arxiv.org/pdf/2604.00344
Abstract Large Language Models (LLMs) have shown remarkable performance in completing various tasks. However, solving complex problems often requires the coordination of multiple agents, raising a fundamental question: how to effectively select and interconnect these agents. In this paper, we propose \textbf{Agent Q-Mix}, a reinforcement learning framework that reformulates topology selection as a cooperative Multi-Agent Reinforcement Learning (MARL) problem. Our method learns decentralized communication decisions using QMIX value factorization, where each agent selects from a set of communication actions that jointly induce a round-wise communication graph. At its core, Agent Q-Mix combines a topology-aware GNN encoder, GRU memory, and per-agent Q-heads under a Centralized Training with Decentralized Execution (CTDE) paradigm. The framework optimizes a reward function that balances task accuracy with token cost. Across seven core benchmarks in coding, reasoning, and mathematics, Agent Q-Mix achieves the highest average accuracy compared to existing methods while demonstrating superior token efficiency and robustness against agent failure. Notably, on the challenging Humanity's Last Exam (HLE) using Gemini-3.1-Flash-Lite as a backbone, Agent Q-Mix achieves 20.8\% accuracy, outperforming Microsoft Agent Framework (19.2\%) and LangGraph (19.2\%), followed by AutoGen and Lobster by OpenClaw. These results underscore the effectiveness of learned, decentralized topology optimization in pushing the boundaries of multi-agent reasoning.
中文摘要 大型语言模型（LLMs）在完成各种任务方面表现出了卓越的表现。然而，解决复杂问题往往需要多个代理协调，这引发了一个根本性问题：如何有效地选择并连接这些代理。本文提出了\textbf{Agent Q-Mix}，一种强化学习框架，将拓扑选择重新表述为合作式多代理强化学习（MARL）问题。我们的方法通过QMIX价值因式分解学习去中心化通信决策，每个代理从一组通信动作中选择，这些动作共同诱导出轮次通信图。Agent Q-Mix 的核心结合了拓扑感知的 GNN 编码器、GRU 内存和每个代理的 Q 头，采用集中式训练与去中心化执行（CTDE）范式。该框架优化了一个奖励函数，平衡任务准确性与代币成本。在编码、推理和数学的七个核心基准测试中，Agent Q-Mix 实现了与现有方法相比最高的平均准确率，同时展现出卓越的令牌效率和对智能体失败的鲁棒性。值得注意的是，在以Gemini-3.1-Flash-Lite为骨干的挑战性“人类最后考试”（HLE）中，Agent Q-Mix实现了20.8%的准确率，优于Microsoft Agent Framework（19.2%）和LangGraph（19.2%），其次是OpenClaw的AutoGen和Lobster。这些结果强调了学习式去中心化拓扑优化在推动多智能体推理边界方面的有效性。

GUIDE: Reinforcement Learning for Behavioral Action Support in Type 1 Diabetes

指南：1型糖尿病行为行动支持的强化学习

Authors: Saman Khamesian, Sri Harini Balaji, Di Yang Shi, Stephanie M. Carpenter, Daniel E. Rivera, W. Bradley Knox, Peter Stone, Hassan Ghasemzadeh
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2604.00385
Pdf link: https://arxiv.org/pdf/2604.00385
Abstract Type 1 Diabetes (T1D) management requires continuous adjustment of insulin and lifestyle behaviors to maintain blood glucose within a safe target range. Although automated insulin delivery (AID) systems have improved glycemic outcomes, many patients still fail to achieve recommended clinical targets, warranting new approaches to improve glucose control in patients with T1D. While reinforcement learning (RL) has been utilized as a promising approach, current RL-based methods focus primarily on insulin-only treatment and do not provide behavioral recommendations for glucose control. To address this gap, we propose GUIDE, an RL-based decision-support framework designed to complement AID technologies by providing behavioral recommendations to prevent abnormal glucose events. GUIDE generates structured actions defined by intervention type, magnitude, and timing, including bolus insulin administration and carbohydrate intake events. GUIDE integrates a patient-specific glucose level predictor trained on real-world continuous glucose monitoring data and supports both offline and online RL algorithms within a unified environment. We evaluate both off-policy and on-policy methods across 25 individuals with T1D using standardized glycemic metrics. Among the evaluated approaches, the CQL-BC algorithm demonstrates the highest average time-in-range, reaching 85.49% while maintaining low hypoglycemia exposures. Behavioral similarity analysis further indicates that the learned CQL-BC policy preserves key structural characteristics of patient action patterns, achieving a mean cosine similarity of 0.87 $\pm$ 0.09 across subjects. These findings suggest that conservative offline RL with a structured behavioral action space can provide clinically meaningful and behaviorally plausible decision support for personalized diabetes management.
中文摘要 1型糖尿病（T1D）的管理需要持续调整胰岛素和生活方式，以将血糖维持在安全的目标范围内。尽管自动胰岛素递送（AID）系统改善了血糖结果，但仍有许多患者未能达到推荐的临床目标，因此需要采用新的方法来改善1型糖尿病患者的血糖控制。虽然强化学习（RL）被用作一种有前景的方法，但目前基于强化学习的方法主要关注单纯胰岛素治疗，并未提供血糖控制的行为建议。为弥补这一空白，我们提出了GUIDE，这是一个基于强化学习的决策支持框架，旨在补充艾滋病技术，提供行为建议以防止异常血糖事件。GUIDE生成结构化的行动，涵盖干预类型、规模和时间，包括胰岛素注射和碳水化合物摄入事件。GUIDE集成了基于真实世界连续血糖监测数据训练的患者特定血糖水平预测器，支持线下和在线强化学习算法，在统一环境中实现。我们使用标准化血糖指标评估了25名1型糖尿病患者中的非政策和非政策方法。在评估方法中，CQL-BC算法的平均时间范围内最高，达到85.49%，同时保持低低血糖暴露。行为相似性分析进一步表明，学到的CQL-BC策略保留了患者行为模式的关键结构特征，实现了受试者间平均余弦相似度为0.87 $\pm$ 0.09。这些发现表明，保守的离线强化学习配合结构化的行为行动空间，能够为个性化糖尿病管理提供临床意义和行为合理性的决策支持。

Internal State-Based Policy Gradient Methods for Partially Observable Markov Potential Games

部分可观测马尔可夫势博弈的内部基于状态的策略梯度方法

Authors: Wonseok Yang, Thinh T. Doan
Subjects: Subjects: Multiagent Systems (cs.MA); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2604.00433
Pdf link: https://arxiv.org/pdf/2604.00433
Abstract This letter studies multi-agent reinforcement learning in partially observable Markov potential games. Solving this problem is challenging due to partial observability, decentralized information, and the curse of dimensionality. First, to address the first two challenges, we leverage the common information framework, which allows agents to act based on both shared and local information. Second, to ensure tractability, we study an internal state that compresses accumulated information, preventing it from growing unboundedly over time. We then implement an internal state-based natural policy gradient method to find Nash equilibria of the Markov potential game. Our main contribution is to establish a non-asymptotic convergence bound for this method. Our theoretical bound decomposes into two interpretable components: a statistical error term that also arises in standard Markov potential games, and an approximation error capturing the use of finite-state controllers. Finally, simulations across multiple partially observable environments demonstrate that the proposed method using finite-state controllers achieves consistent improvements in performance compared to the setting where only the current observation is used.
中文摘要 本信研究部分可观测马尔可夫势博弈中的多智能体强化学习。解决这一问题具有挑战性，原因在于部分可观测性、信息分散以及维度的诅咒。首先，为了解决前两个挑战，我们利用了通用信息框架，使代理能够基于共享信息和本地信息进行行动。其次，为了确保可处理性，我们研究一种内部状态，它压缩积累的信息，防止其随时间无限增长。然后，我们实现一种基于状态的内部自然策略梯度方法，以求得马尔可夫势阱的纳什均衡。我们的主要贡献是建立该方法的非渐近收敛约束。我们的理论界限分解为两个可解释的成分：一个统计误差项，该项也出现在标准马尔可夫势博弈中;另一个是捕捉有限状态控制器使用的近似误差。最后，跨多个部分可观测环境的模拟表明，使用有限状态控制器的拟议方法相比仅使用当前观测值的环境，在性能上持续提升。

TR-ICRL: Test-Time Rethinking for In-Context Reinforcement Learning

TR-ICRL：情境强化学习中的测试时间再思考

Authors: Wenxuan Jiang, Yuxin Zuo, Zijian Zhang, Xuecheng Wu, Zining Fan, Wenxuan Liu, Li Chen, Xiaoyu Li, Xuezhi Cao, Xiaolong Jin, Ninghao Liu
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2604.00438
Pdf link: https://arxiv.org/pdf/2604.00438
Abstract In-Context Reinforcement Learning (ICRL) enables Large Language Models (LLMs) to learn online from external rewards directly within the context window. However, a central challenge in ICRL is reward estimation, as models typically lack access to ground-truths during inference. To address this limitation, we propose Test-Time Rethinking for In-Context Reinforcement Learning (TR-ICRL), a novel ICRL framework designed for both reasoning and knowledge-intensive tasks. TR-ICRL operates by first retrieving the most relevant instances from an unlabeled evaluation set for a given query. During each ICRL iteration, LLM generates a set of candidate answers for every retrieved instance. Next, a pseudo-label is derived from this set through majority voting. This label then serves as a proxy to give reward messages and generate formative feedbacks, guiding LLM through iterative refinement. In the end, this synthesized contextual information is integrated with the original query to form a comprehensive prompt, with the answer determining through a final round of majority voting. TR-ICRL is evaluated on mainstream reasoning and knowledge-intensive tasks, where it demonstrates significant performance gains. Remarkably, TR-ICRL improves Qwen2.5-7B by 21.23% on average on MedQA and even 137.59% on AIME2024. Extensive ablation studies and analyses further validate the effectiveness and robustness of our approach. Our code is available at this https URL.
中文摘要 上下文强化学习（ICRL）使大型语言模型（LLMs）能够直接在上下文窗口内通过外部奖励在线学习。然而，ICRL的核心挑战是奖励估计，因为模型在推断过程中通常缺乏获得实地真相。为解决这一限制，我们提出了针对上下文强化学习的测试时间重思考（TR-ICRL），这是一种新颖的ICRL框架，旨在支持推理和知识密集型任务。TR-ICRL的工作原理是首先从给定查询的未标记评估集中检索最相关的实例。在每次ICRL迭代中，LLM为每个检索的实例生成一组候选答案。接下来，通过多数投票从该集合中推导出伪标签。该标签作为代理，传递奖励信息并生成形成性反馈，引导LLM进行迭代完善。最终，这些综合的上下文信息与原始查询整合，形成一个全面的提示，答案通过最后一轮多数投票决定。TR-ICRL在主流推理和知识密集型任务中进行评估，展现出显著的性能提升。值得注意的是，TR-ICRL在MedQA上平均提升Qwen2.5-7B为21.23%，在AIME2024上提升了137.59%。广泛的消融研究和分析进一步验证了我们方法的有效性和稳健性。我们的代码可在此 https URL 访问。

Execution-Verified Reinforcement Learning for Optimization Modeling

优化建模的执行验证强化学习

Authors: Runda Guan, Xiangqing Shen, Jiajun Zhang, Yifan Zhang, Jian Cheng, Rui Xia
Subjects: Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2604.00442
Pdf link: https://arxiv.org/pdf/2604.00442
Abstract Automating optimization modeling with LLMs is a promising path toward scalable decision intelligence, but existing approaches either rely on agentic pipelines built on closed-source LLMs with high inference latency, or fine-tune smaller LLMs using costly process supervision that often overfits to a single solver API. Inspired by reinforcement learning with verifiable rewards, we propose Execution-Verified Optimization Modeling (EVOM), an execution-verified learning framework that treats a mathematical programming solver as a deterministic, interactive verifier. Given a natural-language problem and a target solver, EVOM generates solver-specific code, executes it in a sandboxed harness, and converts execution outcomes into scalar rewards, optimized with GRPO and DAPO in a closed-loop generate-execute-feedback-update process. This outcome-only formulation removes the need for process-level supervision, and enables cross-solver generalization by switching the verification environment rather than reconstructing solver-specific datasets. Experiments on NL4OPT, MAMO, IndustryOR, and OptiBench across Gurobi, OR-Tools, and COPT show that EVOM matches or outperforms process-supervised SFT, supports zero-shot solver transfer, and achieves effective low-cost solver adaptation by continuing training under the target solver backend.
中文摘要 利用LLM自动化优化建模是迈向可扩展决策智能的有前景路径，但现有方法要么依赖基于高推理延迟的闭源LLM构建的代理流水线，要么通过昂贵且常常过度拟合于单一求解器API的过程监督对较小的LLM进行微调。受可验证奖励的强化学习启发，我们提出了执行验证优化建模（EVOM），这是一种执行验证学习框架，将数学规划求解器视为确定性、交互式的验证器。给定一个自然语言问题和目标求解器，EVOM生成求解器专用代码，在沙箱框架中执行，并将执行结果转换为标量奖励，采用GRPO和DAPO优化，采用闭环生成-执行-反馈-更新过程。这种仅以结果为导向的表述消除了对过程层级监督的需求，并通过切换验证环境而非重建求解器特定数据集，实现跨求解器的泛化。在Gurobi、OR-Tools和COPT上的NL4OPT、MAMO、IndustryOR和OptiBench上的实验显示，EVOM能够匹敌甚至优于过程监督的SFT，支持零样本求解器转移，并通过在目标求解器后端持续训练实现有效的低成本求解器适配。

All Roads Lead to Rome: Incentivizing Divergent Thinking in Vision-Language Models

《所有道路通向罗马：在视觉语言模型中激励发散性思维》

Authors: Xinyu Tian, Shu Zou, Zhaoyuan Yang, Mengqi He, Peter Tu, Jing Zhang
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2604.00479
Pdf link: https://arxiv.org/pdf/2604.00479
Abstract Recent studies have demonstrated that Reinforcement Learning (RL), notably Group Relative Policy Optimization (GRPO), can intrinsically elicit and enhance the reasoning capabilities of Vision-Language Models (VLMs). However, despite the promise, the underlying mechanisms that drive the effectiveness of RL models as well as their limitations remain underexplored. In this paper, we highlight a fundamental behavioral distinction between RL and base models, where the former engages in deeper yet narrow reasoning, while base models, despite less refined along individual path, exhibit broader and more diverse thinking patterns. Through further analysis of training dynamics, we show that GRPO is prone to diversity collapse, causing models to prematurely converge to a limited subset of reasoning strategies while discarding the majority of potential alternatives, leading to local optima and poor scalability. To address this, we propose Multi-Group Policy Optimization (MUPO), a simple yet effective approach designed to incentivize divergent thinking across multiple solutions, and demonstrate its effectiveness on established benchmarks. Project page: this https URL
中文摘要 最新研究表明，强化学习（RL），尤其是群体相对策略优化（Group Relative Policy Optimization，GRPO），能够内在激发并增强视觉语言模型（VLMs）的推理能力。然而，尽管前景看好，驱动强化学习模型有效性的基本机制及其局限性仍未被充分探索。本文强调了强化学习与基础模型之间的根本行为区别，前者从事更深层次但狭窄的推理，而基础模型虽然个体路径较为细腻，但表现出更广泛、更多样化的思维模式。通过对训练动态的进一步分析，我们表明GRPO容易发生多样性崩溃，导致模型过早收敛到有限的推理策略子集，同时舍弃大多数潜在替代方案，导致局部最优和扩展性差。为此，我们提出了多组策略优化（MUPO），这是一种简单但有效的方法，旨在激励多方发散性思维，并在既定基准上展示其有效性。项目页面：此 https URL

A Reasoning-Enabled Vision-Language Foundation Model for Chest X-ray Interpretation

一个基于推理的视觉语言基础模型用于胸部X光解读

Authors: Yabin Zhang, Chong Wang, Yunhe Gao, Jiaming Liu, Maya Varma, Justin Xu, Sophie Ostmeier, Jin Long, Sergios Gatidis, Seena Dehkharghani, Arne Michalson, Eun Kyoung Hong, Christian Bluethgen, Haiwei Henry Guo, Alexander Victor Ortiz, Stephan Altmayer, Sandhya Bodapati, Joseph David Janizek, Ken Chang, Jean-Benoit Delbrouck, Akshay S. Chaudhari, Curtis P. Langlotz
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2604.00493
Pdf link: https://arxiv.org/pdf/2604.00493
Abstract Chest X-rays (CXRs) are among the most frequently performed imaging examinations worldwide, yet rising imaging volumes increase radiologist workload and the risk of diagnostic errors. Although artificial intelligence (AI) systems have shown promise for CXR interpretation, most generate only final predictions, without making explicit how visual evidence is translated into radiographic findings and diagnostic predictions. We present CheXOne, a reasoning-enabled vision-language model for CXR interpretation. CheXOne jointly generates diagnostic predictions and explicit, clinically grounded reasoning traces that connect visual evidence, radiographic findings, and these predictions. The model is trained on 14.7 million instruction and reasoning samples curated from 30 public datasets spanning 36 CXR interpretation tasks, using a two-stage framework that combines instruction tuning with reinforcement learning to improve reasoning quality. We evaluate CheXOne in zero-shot settings across visual question answering, report generation, visual grounding and reasoning assessment, covering 17 evaluation settings. CheXOne outperforms existing medical and general-domain foundation models and achieves strong performance on independent public benchmarks. A clinical reader study demonstrates that CheXOne-drafted reports are comparable to or better than resident-written reports in 55% of cases, while effectively addressing clinical indications and enhancing both report writing and CXR interpretation efficiency. Further analyses involving radiologists reveal that the generated reasoning traces show high clinical factuality and provide causal support for the final predictions, offering a plausible explanation for the performance gains. These results suggest that explicit reasoning can improve model performance, interpretability and clinical utility in AI-assisted CXR interpretation.
中文摘要 胸部X光（CXR）是全球最常进行的影像检查之一，但影像量的增加增加了放射科医生的工作量和诊断错误的风险。尽管人工智能（AI）系统在CXR解读方面展现出潜力，但大多数系统仅生成最终预测，未明确说明视觉证据如何转化为X光学和诊断预测。我们介绍CheXOne，一种基于推理能力的视觉语言模型，用于CXR解读。CheXOne联合生成诊断预测和明确、临床基础的推理痕迹，将视觉证据、X线检查结果与这些预测联系起来。该模型基于1470万个指令和推理样本训练，这些样本来自30个公开数据集，涵盖36个CXR解读任务，采用结合指令调优与强化学习的两阶段框架，以提升推理质量。我们在零样本环境中评估CheXOne，涵盖视觉问答、报告生成、视觉基础和推理评估，涵盖17个评估场景。CheXOne优于现有的医学和普通领域基础模型，并在独立的公共基准测试中取得优异表现。一项临床读者研究表明，CheXOne起草的报告在55%的病例中与住院医师撰写的报告相当或更好，同时有效应对临床适应症，提升报告写作和CXR解读效率。放射科医生的进一步分析显示，生成的推理痕迹具有高度的临床事实性，并为最终预测提供了因果支持，为性能提升提供了合理的解释。这些结果表明，显性推理可以提升模型性能、可解释性以及AI辅助CXR解读中的临床效用。

MOON3.0: Reasoning-aware Multimodal Representation Learning for E-commerce Product Understanding

MOON3.0：推理感知多模态表示学习，用于电子商务产品理解

Authors: Junxian Wu, Chenghan Fu, Zhanheng Nie, Daoze Zhang, Bowen Wan, Wanxian Guan, Chuan Yu, Jian Xu, Bo Zheng
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)
Arxiv link: https://arxiv.org/abs/2604.00513
Pdf link: https://arxiv.org/pdf/2604.00513
Abstract With the rapid growth of e-commerce, exploring general representations rather than task-specific ones has attracted increasing attention. Although recent multimodal large language models (MLLMs) have driven significant progress in product understanding, they are typically employed as feature extractors that implicitly encode product information into global embeddings, thereby limiting their ability to capture fine-grained attributes. Therefore, we argue that leveraging the reasoning capabilities of MLLMs to explicitly model fine-grained product attributes holds significant potential. Nevertheless, achieving this goal remains non-trivial due to several key challenges: (i) long-context reasoning tends to dilute the model's attention to salient information in the raw input; (ii) supervised fine-tuning (SFT) primarily encourages rigid imitation, limiting the exploration of effective reasoning strategies; and (iii) fine-grained details are progressively attenuated during forward propagation. To address these issues, we propose MOON3.0, the first reasoning-aware MLLM-based model for product representation learning. Our method (1) employs a multi-head modality fusion module to adaptively integrate raw signals; (2) incorporates a joint contrastive and reinforcement learning framework to autonomously explore more effective reasoning strategies; and (3) introduces a fine-grained residual enhancement module to progressively preserve local details throughout the network. Additionally, we release a large-scale multimodal e-commerce benchmark MBE3.0. Experimentally, our model demonstrates state-of-the-art zero-shot performance across various downstream tasks on both our benchmark and public datasets.
中文摘要 随着电子商务的快速发展，探索通用表征而非特定任务表征的关注度日益增加。尽管近年来多模态大型语言模型（MLLM）推动了产品理解的显著进展，但它们通常被用作特征提取器，隐式地将产品信息编码到全局嵌入中，从而限制了其捕捉细粒度属性的能力。因此，我们认为利用MLLM的推理能力明确建模细粒度产品属性具有显著潜力。然而，由于几个关键挑战，实现这一目标仍然不易：（i）长上下文推理往往稀释模型对原始输入中显著信息的关注;（ii）监督式微调（SFT）主要鼓励僵化模仿，限制有效推理策略的探索;以及（iii）细粒度细节在前向传播过程中逐渐衰减。为解决这些问题，我们提出了MOON3.0，这是首个基于推理感知的MLLM产品表示学习模型。我们的方法（1）采用多头模态融合模块，自适应地整合原始信号;（2）结合对比学习和强化学习框架，自主探索更有效的推理策略;（3）引入细粒度残差增强模块，逐步保留网络中的局部细节。此外，我们还发布了大规模多模式电子商务基准MBE3.0。通过实验，我们的模型在基准和公开数据集上展示了在多个下游任务中最先进的零样本性能。

AceTone: Bridging Words and Colors for Conditional Image Grading

AceTone：连接词语与颜色以实现条件图像评分

Authors: Tianren Ma, Mingxiang Liao, Xijin Zhang, Qixiang Ye
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2604.00530
Pdf link: https://arxiv.org/pdf/2604.00530
Abstract Color affects how we interpret image style and emotion. Previous color grading methods rely on patch-wise recoloring or fixed filter banks, struggling to generalize across creative intents or align with human aesthetic preferences. In this study, we propose AceTone, the first approach that supports multimodal conditioned color grading within a unified framework. AceTone formulates grading as a generative color transformation task, where a model directly produces 3D-LUTs conditioned on text prompts or reference images. We develop a VQ-VAE based tokenizer which compresses a $3\times32^3$ LUT vector to 64 discrete tokens with $\Delta E<2$ fidelity. We further build a large-scale dataset, AceTone-800K, and train a vision-language model to predict LUT tokens, followed by reinforcement learning to align outputs with perceptual fidelity and aesthetics. Experiments show that AceTone achieves state-of-the-art performance on both text-guided and reference-guided grading tasks, improving LPIPS by up to 50% over existing methods. Human evaluations confirm that AceTone's results are visually pleasing and stylistically coherent, demonstrating a new pathway toward language-driven, aesthetic-aligned color grading.
中文摘要 颜色影响我们对图像风格和情感的解读。以往的色彩分级方法依赖于按色块重新上色或固定滤镜库，难以跨创意意图进行泛化或符合人类审美偏好。本研究提出AceTone，这是首个在统一框架内支持多模态条件色彩分级的方法。AceTone 将分级表述为一种生成式色彩转换任务，模型直接根据文本提示或参考图像生成 3D-LUT。我们开发了基于VQ-VAE的分词器，将$3\times32^3$的LUT向量压缩为64个独立代币，保真度为$\Delta E<2$。我们还进一步构建了大规模数据集AceTone-800K，并训练视觉语言模型以预测LUT标记，随后进行强化学习，使输出与感知真实度和美学保持一致。实验显示，AceTone在文本引导和参考文献引导评分任务中均达到最先进的性能，LPIPS比现有方法提升了多达50%。人工评估证实，Acetone的结果视觉赏心悦目且风格统一，展示了语言驱动、美学对齐色彩分级的新路径。

Optimsyn: Influence-Guided Rubrics Optimization for Synthetic Data Generation

Optimsyn：影响引导评分标准优化合成数据生成

Authors: Zhiting Fan, Ruizhe Chen, Tianxiang Hu, Ru Peng, Zenan Huang, Haokai Xu, Yixin Chen, Jian Wu, Junbo Zhao, Zuozhu Liu
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2604.00536
Pdf link: https://arxiv.org/pdf/2604.00536
Abstract Large language models (LLMs) achieve strong downstream performance largely due to abundant supervised fine-tuning (SFT) data. However, high-quality SFT data in knowledge-intensive domains such as humanities, social sciences, medicine, law, and finance is scarce because expert curation is expensive, privacy constraints are strict, and label consistency is hard to ensure. Recent work uses synthetic data, typically by prompting a generator over domain documents and filtering outputs with handcrafted rubrics. Yet rubric design is expert-dependent, transfers poorly across domains, and is often optimized through a brittle heuristic loop of writing rubrics, synthesizing data, training, inspecting results, and manually guessing revisions. This process lacks reliable quantitative feedback about how a rubric affects downstream performance. We propose evaluating synthetic data by its training utility on the target model and using this signal to guide data generation. Inspired by influence estimation, we adopt an optimizer-aware estimator that uses gradient information to quantify each synthetic sample's contribution to a target model's objective on specific tasks. Our analysis shows that even when synthetic and real samples are close in embedding space, their influence on learning can differ substantially. Based on this insight, we propose an optimization-based framework that adapts rubrics using target-model feedback. We provide lightweight guiding text and use a rubric-specialized model to generate task-conditioned rubrics. Influence score is used as the reward to optimize the rubric generator with reinforcement learning. Experiments across domains, target models, and data generators show consistent improvements and strong generalization without task-specific tuning.
中文摘要 大型语言模型（LLMs）在下游表现良好，很大程度上得益于丰富的监督式微调（SFT）数据。然而，在人文、社会科学、医学、法律和金融等知识密集型领域，高质量的SFT数据稀缺，因为专家策展成本高昂，隐私限制严格，标签一致性难以确保。近期工作使用合成数据，通常通过在领域文档上提示生成器，并用手工制作的评分标准过滤输出。然而，评分标准设计依赖专家，跨领域转移能力差，且常通过编写评分标准、综合数据、训练、检查结果和手动猜测修订的脆弱启发式循环进行优化。该过程缺乏关于评分标准如何影响下游性能的可靠定量反馈。我们建议通过合成数据在目标模型上的训练效用来评估合成数据，并以此信号指导数据生成。受影响估计启发，我们采用了一种基于优化器的估计器，利用梯度信息量化每个合成样本对目标模型在特定任务中的贡献。我们的分析显示，即使合成样本和真实样本嵌入空间相近，它们对学习的影响也可能有显著差异。基于这一见解，我们提出了一个基于优化的框架，利用目标模型反馈调整评分标准。我们提供轻量级指导文本，并使用专门的评分标准模型生成任务条件化评分标准。影响分数作为奖励，用于强化学习优化评分标准生成器。跨领域、目标模型和数据生成器的实验显示，在没有针对特定任务调整的情况下，持续的改进和强有力的泛化效果。

Toward Efficient Deployment and Synchronization in Digital Twins-Empowered Networks

迈向数字孪生赋能网络的高效部署与同步

Authors: Hossam Farag, Cedomir Stefanovic
Subjects: Subjects: Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2604.00566
Pdf link: https://arxiv.org/pdf/2604.00566
Abstract Digital twins (DTs) are envisioned as a key enabler of the cyber-physical continuum in future wireless networks. However, efficient deployment and synchronization of DTs in dynamic multi-access edge computing (MEC) environments remains challenging due to time-varying communication and computational resources. This paper investigates the joint optimization of DT deployment and synchronization in dynamic MEC environments. A deep reinforcement learning (DRL) framework is proposed for adaptive DT placement and association to minimize interaction latency between physical and digital entities. To ensure semantic freshness, an update scheduling policy is further designed to minimize the long-term weighted sum of the Age of Changed Information (AoCI) and the update cost. A relative policy iteration algorithm with a threshold-based structure is developed to derive the optimal policy. Simulation results show that the proposed methods achieve lower latency, enhanced information freshness, and reduced system cost compared with benchmark schemes
中文摘要 数字孪生（DT）被设想为未来无线网络中网络物理连续体的关键推动力。然而，由于通信和计算资源时变，动态多址边缘计算（MEC）环境中高效部署和同步DT仍然具有挑战性。本文探讨了动态MEC环境中DT部署与同步的联合优化。提出了一种深度强化学习（DRL）框架，用于自适应DT的放置和关联，以最小化物理实体与数字实体之间的交互延迟。为确保语义新颖，更新调度策略进一步设计以最小化信息变更时代（AoCI）与更新成本的长期加权总和。开发了一个基于阈值结构的相对策略迭代算法，以推导最优策略。仿真结果表明，所提出的方法相比基准方案实现了更低的延迟、增强的信息新鲜度和更低的系统成本

A Physical Imitation Learning Pipeline for Energy-Efficient Quadruped Locomotion Assisted by Parallel Elastic Joint

一个物理模仿学习管道，用于节能四足行走，辅助平行弹性关节

Authors: Huyue Ma, Yurui Jin, Helmut Hauser, Rui Wu
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2604.00611
Pdf link: https://arxiv.org/pdf/2604.00611
Abstract Due to brain-body co-evolution, animals' intrinsic body dynamics play a crucial role in energy-efficient locomotion, which shares control effort between active muscles and passive body dynamics -- a principle known as Embodied Physical Intelligence. In contrast, robot bodies are often designed with one centralised controller that typically suppress the intrinsic body dynamics instead of exploiting it. We introduce Physical Imitation Learning (PIL), which distils a Reinforcement Learning (RL) control policy into physically implementable body responses that can be directly offloaded to passive Parallel Elastic Joints (PEJs), enabling therefore the body to imitate part of the controlled behaviour. Meanwhile, the residual policy commands the motors to recover the RL policy's performance. The results is an overall reduced energy consumption thanks to outsourcing parts of the control policy to the PEJs. Here we show in simulated quadrupeds, that our PIL approach can offloads up to 87% of mechanical power to PEJs on flat terrain and 18% on rough terrain. Because the body design is distilled from -- rather than jointly optimised with -- the control policy, PIL realises brain-body co-design without expanding the search space with body design parameters, providing a computationally efficient route to task-specific Embodied Physical Intelligence applicable to a wide range of joint-based robot morphologies.
中文摘要 由于脑-身体的共同进化，动物的内在身体动力学在节能运动中起着关键作用，这种运动在主动肌肉和被动身体动力学之间共享控制力——这一原则被称为具身体智力。相比之下，机器人身体通常设计为一个集中控制装置，通常抑制内在的身体动力学，而非加以利用。我们引入了物理模仿学习（PIL），将强化学习（RL）控制策略提炼为可物理实现的身体反应，这些反应可以直接卸载到被动的平行弹性关节（PEJ），从而使身体能够模仿部分受控行为。与此同时，残余策略命令电机恢复RL策略的性能。其结果是整体能源消耗减少，得益于将部分控制政策外包给PEJ。在这里，我们在模拟四足动物中展示了，我们的PIL方法在平坦地形上可将高达87%的机械动力转移给PEJ，在崎岖地形上可释放18%。由于身体设计是从控制策略中提炼而非联合优化，PIL实现了脑-身体协同设计，而无需扩展身体设计参数的搜索空间，为适用于多种关节型机器人形态提供了高效的任务特定具身物理智能路径。

Full-Gradient Successor Feature Representations

全梯度继任特征表示

Authors: Ritish Shrirao, Aditya Priyadarshi, Raghuram Bharadwaj Diddigi
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2604.00686
Pdf link: https://arxiv.org/pdf/2604.00686
Abstract Successor Features (SF) combined with Generalized Policy Improvement (GPI) provide a robust framework for transfer learning in Reinforcement Learning (RL) by decoupling environment dynamics from reward functions. However, standard SF learning methods typically rely on semi-gradient Temporal Difference (TD) updates. When combined with non-linear function approximation, semi-gradient methods lack robust convergence guarantees and can lead to instability, particularly in the multi-task setting where accurate feature estimation is critical for effective GPI. Inspired by Full Gradient DQN, we propose Full-Gradient Successor Feature Representations Q-Learning (FG-SFRQL), an algorithm that optimizes the successor features by minimizing the full Mean Squared Bellman Error. Unlike standard approaches, our method computes gradients with respect to parameters in both the online and target networks. We provide a theoretical proof of almost-sure convergence for FG-SFRQL and demonstrate empirically that minimizing the full residual leads to superior sample efficiency and transfer performance compared to semi-gradient baselines in both discrete and continuous domains.
中文摘要 继任特征（SF）结合广义策略改进（GPI）为强化学习（RL）中的迁移学习提供了一个稳健的框架，通过将环境动态与奖励函数解耦。然而，标准的SF学习方法通常依赖半梯度时间差（TD）更新。当与非线性函数近似结合时，半梯度方法缺乏稳健的收敛保证，可能导致不稳定性，尤其是在多任务环境中，准确的特征估计对有效的GPI至关重要。受全梯度DQN启发，我们提出了全梯度后继特征表示Q-学习（FG-SFRQL）算法，通过最小化全均方贝尔曼误差来优化后继特征。与标准方法不同，我们的方法在在线和目标网络中都计算参数的梯度。我们为FG-SFRQL提供了几乎确定收敛的理论证明，并通过实证证明，最小化全残差能在离散和连续域中优于半梯度基线，获得更优的样本效率和传输性能。

TTA-Vid: Generalized Test-Time Adaptation for Video Reasoning

TTA-Vid：视频推理的广义测试时间适应

Authors: Soumya Shamarao Jahagirdar, Edson Araujo, Anna Kukleva, M. Jehanzeb Mirza, Saurabhchand Bhati, Samuel Thomas, Brian Kingsbury, Rogerio Feris, James R. Glass, Hilde Kuehne
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2604.00696
Pdf link: https://arxiv.org/pdf/2604.00696
Abstract Recent video reasoning models have shown strong results on temporal and multimodal understanding, yet they depend on large-scale supervised data and multi-stage training pipelines, making them costly to train and difficult to adapt to new domains. In this work, we leverage the paradigm of Test-Time Reinforcement Learning on video-language data to allow for adapting a pretrained model to incoming video samples at test-time without explicit labels. The proposed test-time adaptation for video approach (TTA-Vid) combines two components that work simultaneously: (1) a test-time adaptation that performs step-by-step reasoning at inference time on multiple frame subsets. We then use a batch-aware frequency-based reward computed across different frame subsets as pseudo ground truth to update the model. It shows that the resulting model trained on a single batch or even a single sample from a dataset, is able to generalize at test-time to the whole dataset and even across datasets. Because the adaptation occurs entirely at test time, our method requires no ground-truth annotations or dedicated training splits. Additionally, we propose a multi-armed bandit strategy for adaptive frame selection that learns to prioritize informative frames, guided by the same reward formulation. Our evaluation shows that TTA-Vid yields consistent improvements across various video reasoning tasks and is able to outperform current state-of-the-art methods trained on large-scale data. This highlights the potential of test-time reinforcement learning for temporal multimodal understanding.
中文摘要 近期的视频推理模型在时间和多模态理解方面取得了显著成效，但它们依赖于大规模监督数据和多阶段训练流程，导致训练成本高且难以适应新领域。本研究利用测试时间强化学习的视频语言数据范式，允许在测试时将预训练模型适应输入视频样本，无需显式标签。所提议的视频测试时间适配方法（TTA-Vid）结合了两个同时工作的组成部分：（1）测试时间适配，在推断时对多个帧子集进行逐步推理。然后我们使用跨不同帧子集计算的批次感知频率奖励作为伪基准值来更新模型。它表明，基于单批甚至单个数据集样本训练的模型，能够在测试时推广到整个数据集甚至跨数据集。由于适应完全发生在测试时，我们的方法不需要任何地面真实注释或专门的训练分段。此外，我们提出了一种多臂强盗策略用于自适应帧选择，学习优先排序信息帧，并以相同的奖励公式为指导。我们的评估显示，TTA-Vid在多种视频推理任务中持续提升，能够超越基于大规模数据训练的当前最先进方法。这凸显了测试时强化学习在时间多模态理解中的潜力。

Learning to Hint for Reinforcement Learning

学习提示以促进强化学习

Authors: Yu Xia, Canwen Xu, Zhewei Yao, Julian McAuley, Yuxiong He
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2604.00698
Pdf link: https://arxiv.org/pdf/2604.00698
Abstract Group Relative Policy Optimization (GRPO) is widely used for reinforcement learning with verifiable rewards, but it often suffers from advantage collapse: when all rollouts in a group receive the same reward, the group yields zero relative advantage and thus no learning signal. For example, if a question is too hard for the reasoner, all sampled rollouts can be incorrect and receive zero reward. Recent work addresses this issue by adding hints or auxiliary scaffolds to such hard questions so that the reasoner produces mixed outcomes and recovers a non-zero update. However, existing hints are usually fixed rather than adapted to the current reasoner, and a hint that creates learning signal under the hinted input does not necessarily improve the no-hint policy used at test time. To this end, we propose Hint Learning for Reinforcement Learning (HiLL), a framework that jointly trains a hinter policy and a reasoner policy during RL. For each hard question, the hinter generates hints online conditioned on the current reasoner's incorrect rollout, allowing hint generation to adapt to the reasoner's evolving errors. We further introduce hint reliance, which measures how strongly correct hinted trajectories depend on the hint. We derive a transferability result showing that lower hint reliance implies stronger transfer from hinted success to no-hint success, and we use this result to define a transfer-weighted reward for training the hinter. Therefore, HiLL favors hints that not only recover informative GRPO groups, but also produce signals that are more likely to improve the original no-hint policy. Experiments across multiple benchmarks show that HiLL consistently outperforms GRPO and prior hint-based baselines, demonstrating the value of adaptive and transfer-aware hint learning for RL. The code is available at this https URL.
中文摘要 群体相对策略优化（GRPO）被广泛用于具有可验证奖励的强化学习，但它常常存在优势崩溃的问题：当一个群体中所有推广都获得相同奖励时，该群体没有相对优势，因此没有学习信号。例如，如果问题对推理者来说过于难，所有抽样的展开结果都可能错误且无奖励。近期研究通过在这些难题中添加提示或辅助支架来解决这个问题，使推理者产生混合结果并恢复非零的更新。然而，现有提示通常是固定的，而非适应当前推理器，且在提示输入下产生学习信号的提示不一定能改善测试时使用的无提示策略。为此，我们提出了强化学习的提示学习（HiLL），这是一个在强化学习中联合训练后方策略和推理策略的框架。对于每个难题，hinter会根据当前推理者错误的推理方式在线生成提示，使提示生成能够适应推理者不断变化的错误。我们进一步引入了提示依赖度，衡量正确提示轨迹对提示的依赖程度。我们推导出一个可转移性结果，表明较低的提示依赖度意味着从暗示成功转移到无提示成功时的转移更强，并利用该结果定义了训练诱因者的转移加权奖励。因此，HiLL偏好那些不仅能恢复有信息量的GRPO组，还能产生更有可能改善原始无提示政策的提示。跨多个基准测试的实验表明，HiLL始终优于GRPO和以往基于提示的基线，证明了自适应和迁移感知提示学习对强化学习的价值。代码可在该 https URL 访问。

LangMARL: Natural Language Multi-Agent Reinforcement Learning

LangMARL：自然语言多智能体强化学习

Authors: Huaiyuan Yao, Longchao Da, Xiaoou Liu, Charles Fleming, Tianlong Chen, Hua Wei
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2604.00722
Pdf link: https://arxiv.org/pdf/2604.00722
Abstract Large language model (LLM) agents struggle to autonomously evolve coordination strategies in dynamic environments, largely because coarse global outcomes obscure the causal signals needed for local policy refinement. We identify this bottleneck as a multi-agent credit assignment problem, which has long been studied in classical multi-agent reinforcement learning (MARL) but remains underaddressed in LLM-based systems. Building on this observation, we propose LangMARL, a framework that brings credit assignment and policy gradient evolution from cooperative MARL into the language space. LangMARL introduces agent-level language credit assignment, pioneers gradient evolution in language space for policy improvement, and summarizes task-relevant causal relations from replayed trajectories to provide dense feedback and improve convergence under sparse rewards. Extensive experiments across diverse cooperative multi-agent tasks demonstrate improved sample efficiency, interpretability, and strong generalization.
中文摘要 大型语言模型（LLM）代理在动态环境中难以自主演化协调策略，主要原因是粗糙的全局结果掩盖了本地政策细化所需的因果信号。我们将这一瓶颈识别为多智能体学分分配问题，这一问题长期以来在经典多智能体强化学习（MARL）中被研究，但在基于LLM的系统中仍未得到充分解决。基于这一观察，我们提出了LangMARL，这一框架将信用分配和政策梯度演变从合作MARL带入语言空间。LangMARL引入了代理级语言学分赋值，开创了语言空间梯度演化以优化策略，并从重放轨迹总结任务相关因果关系，提供密集反馈并改善稀疏奖励下的收敛性。在多样化的协作多智能体任务中进行的大量实验展示了样本效率、可解释性和强有力的泛化性。

RefineRL: Advancing Competitive Programming with Self-Refinement Reinforcement Learning

RefineRL：通过自我完善强化学习推进竞技编程

Authors: Shaopeng Fu, Xingxing Zhang, Li Dong, Di Wang, Furu Wei
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2604.00790
Pdf link: https://arxiv.org/pdf/2604.00790
Abstract While large language models (LLMs) have demonstrated strong performance on complex reasoning tasks such as competitive programming (CP), existing methods predominantly focus on single-attempt settings, overlooking their capacity for iterative refinement. In this paper, we present RefineRL, a novel approach designed to unleash the self-refinement capabilities of LLMs for CP problem solving. RefineRL introduces two key innovations: (1) Skeptical-Agent, an iterative self-refinement agent equipped with local execution tools to validate generated solutions against public test cases of CP problems. This agent always maintains a skeptical attitude towards its own outputs and thereby enforces rigorous self-refinement even when validation suggests correctness. (2) A reinforcement learning (RL) solution to incentivize LLMs to self-refine with only standard RLVR data (i.e., problems paired with their verifiable answers). Extensive experiments on Qwen3-4B and Qwen3-4B-2507 demonstrate that our method yields substantial gains: after our RL training, these compact 4B models integrated with the Skeptical-Agent not only outperform much larger 32B models but also approach the single-attempt performance of 235B models. These findings suggest that self-refinement holds considerable promise for scaling LLM reasoning, with significant potential for further advancement.
中文摘要 虽然大型语言模型（LLMs）在竞争性编程（CP）等复杂推理任务中表现出色，但现有方法主要关注单次尝试设置，忽视了其迭代优化的能力。本文介绍了RefineRL，一种旨在释放LLM自我精炼能力以解决CP问题的新方法。RefineRL 引入了两项关键创新：（1） Skeptical-Agent，一款迭代自精炼代理，配备本地执行工具，用于验证生成的解决方案与 CP 问题的公开测试案例。该代理始终对自身输出保持怀疑态度，因此即使验证表明正确，也强制执行严格的自我完善。（2）强化学习（RL）解决方案，激励LLM仅用标准RLVR数据进行自我精炼（即问题与其可验证答案的匹配）。对Qwen3-4B和Qwen3-4B-2507的广泛实验表明，我们的方法取得了显著提升：经过强化学习训练后，这些与怀疑代理集成的紧凑型4B模型不仅优于更大的32B模型，还接近235B模型的单次尝试性能。这些发现表明自我完善在扩展大型语言模型推理方面具有巨大潜力，并具有进一步发展的巨大潜力。

Bridging RL and MPC for mixed-integer optimal control with application to Formula 1 race strategies

将强化学习（RL）和MPC桥接，实现混合整数最优控制，并应用于一级方程式赛车策略

Authors: Joschua Wüthrich, Romir Damle, Giona Fieni, Melanie N. Zeilinger, Christopher H. Onder, Andrea Carron
Subjects: Subjects: Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2604.00826
Pdf link: https://arxiv.org/pdf/2604.00826
Abstract We propose a hybrid reinforcement learning (RL) and model predictive control (MPC) framework for mixed-integer optimal control, where discrete variables enter the cost and dynamics but not the constraints. Existing hierarchical approaches use RL only for the discrete action space, leaving continuous optimization to MPC. Unlike these methods, we train the RL agent on the full hybrid action space, ensuring consistency with the cost of the underlying Markov decision process. During deployment, the RL actor is rolled out over the prediction horizon to parametrize an integer-free nonlinear MPC through the discrete action sequence and provide a continuous warm-start. The learned critic serves as a terminal cost to capture long-term performance. We prove recursive feasibility, and validate the framework on a Formula 1 race strategy problem. The hybrid method achieves near-optimal performance relative to an offline mixed-integer nonlinear program benchmark, outperforming a standalone RL agent. Moreover, the hybrid scheme enables adaptation to unseen disturbances through modular MPC extensions at zero retraining cost.
中文摘要 我们提出了一种混合强化学习（RL）和模型预测控制（MPC）框架，用于混合整数最优控制，其中离散变量输入成本和动态，但不输入约束。现有的层级方法仅在离散动作空间中使用强化学习，持续优化则交由MPC完成。与这些方法不同，我们训练强化学习代理在完整的混合动作空间上，确保与底层马尔可夫决策过程的成本一致。部署过程中，强化学习演员会被推向预测视界，通过离散动作序列参数化无整数非线性MPC，并实现连续的热启动。博学的批评者作为终极成本，用于捕捉长期绩效。我们证明了递归可行性，并在一级方程式比赛策略问题上验证了该框架。混合方法相对于离线混合整数非线性程序基准测试实现了近乎最优的性能，优于独立的强化学习代理。此外，混合方案通过模块化MPC扩展实现对未见扰动的适应，且无需再训练成本。

Disentangling to Re-couple: Resolving the Similarity-Controllability Paradox in Subject-Driven Text-to-Image Generation

解开纠缠与重新耦合：解决主体驱动文本到图像生成中的相似性-可控悖论

Authors: Shuang Li, Chao Deng, Hang Chen, Liqun Liu, Zhenyu Hu, Te Cao, Mengge Xue, Yuan Chen, Peng Shu, Huan Yu, Jie Jiang
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2604.00849
Pdf link: https://arxiv.org/pdf/2604.00849
Abstract Subject-Driven Text-to-Image (T2I) Generation aims to preserve a subject's identity while editing its context based on a text prompt. A core challenge in this task is the "similarity-controllability paradox", where enhancing textual control often degrades the subject's fidelity, and vice-versa. We argue this paradox stems from the ambiguous role of text prompts, which are often tasked with describing both the subject and the desired modifications, leading to conflicting signals for the model. To resolve this, we propose DisCo, a novel framework that first Disntangles and then re-Couples visual and textual information. First, our textual-visual decoupling module isolates the sources of information: subject identity is extracted exclusively from the reference image with the entity word of the subject, while the text prompt is simplified to contain only the modification command, where the subject refers to general pronouns, eliminating descriptive ambiguity. However, this strict separation can lead to unnatural compositions between the subject and its contexts. We address this by designing a dedicated reward signal and using reinforcement learning to seamlessly recouple the visually-defined subject and the textually-generated context. Our approach effectively resolves the paradox, enabling simultaneous high-fidelity subject preservation and precise textual control. Extensive experiments demonstrate that our method achieves state-of-the-art performance, producing highly realistic and coherent images.
中文摘要 主语驱动文本转图像（T2I）生成旨在根据文本提示编辑主体的身份，同时编辑其上下文。这项任务的核心挑战是“相似性-可控性悖论”，即增强文本控制常常降低主体的真实度，反之亦然。我们认为，这一悖论源于文本提示的模糊角色，这些提示通常同时负责描述主题和期望的修改，导致模型收到相互矛盾的信号。为解决这个问题，我们提出了DisCo，这是一个新颖的框架，先是Disntangles，然后再重新耦合视觉信息和文本信息。首先，我们的文本-视觉解耦模块分离信息来源：主语身份仅从带有主语实体词的参考图像中提取，文本提示简化为仅包含修改命令，主语指代通用代词，消除描述歧义。然而，这种严格的分离可能导致主体与其语境之间不自然的组合。我们通过设计专用奖励信号并利用强化学习，无缝将视觉定义的主体与文本生成的上下文重新耦合来解决这个问题。我们的方法有效解决了这一悖论，实现了高保真主题保存和精确文本控制。大量实验表明，我们的方法实现了最先进的性能，能够生成高度真实且连贯的图像。

Policy Improvement Reinforcement Learning

策略改进强化学习

Authors: Huaiyang Wang, Xiaojie Li, Deqing Wang, Haoyi Zhou, Zixuan Huang, Yaodong Yang, Jianxin Li, Yikun Ban
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2604.00860
Pdf link: https://arxiv.org/pdf/2604.00860
Abstract Reinforcement Learning with Verifiable Rewards (RLVR) has become a central post-training paradigm for improving the reasoning capabilities of large language models. Yet existing methods share a common blind spot: they optimize policies based on instantaneous group-level or batch-level statistics without ever verifying whether the resulting update actually improved the model. This open-loop design -- updating in isolation at each step, guided only by within-group (batch) reward signals -- means optimization can drift or collapse with no mechanism to detect and correct these failures. We argue that the missing ingredient is policy improvement feedback: the ability to measure and optimize inter-iteration progress directly. To this end, we introduce Policy Improvement Reinforcement Learning (PIRL), a framework that replaces surrogate reward maximization with the explicit objective of maximizing cumulative policy improvement across iterations, and prove this temporal objective is perfectly aligned with maximizing final task performance. Building on PIRL, we propose Policy Improvement Policy Optimization (PIPO), which implements closed-loop optimization through retrospective verification. At each iteration, PIPO evaluates whether the previous update yielded genuine improvement against a sliding-window historical baseline, then actively reinforces beneficial updates and suppresses the harmful ones -- transforming an open-loop process into a self-correcting one. We provide theoretical analysis showing that PIPO performs ascent on the PIRL objective in expectation, and experiments on mathematical reasoning benchmarks demonstrate improved stability and performance over GRPO and its variants.
中文摘要 带可验证奖励的强化学习（RLVR）已成为提升大型语言模型推理能力的核心训练后范式。然而，现有方法有一个共同的盲点：它们基于即时的组级或批级统计数据优化策略，却从未验证最终更新是否真正改善了模型。这种开环设计——每一步单独更新，仅由组内（批）奖励信号引导——意味着优化可能漂移或崩溃，且没有机制去检测和纠正这些失败。我们认为缺失的要素是政策改进反馈：直接衡量和优化迭代进展的能力。为此，我们引入了策略改进强化学习（PIRL），这是一个框架，用最大化迭代累积策略改进的明确目标取代了替代奖励最大化，并证明该时间目标与最大化最终任务绩效完美契合。基于PIRL，我们提出了政策改进政策优化（PIPO），通过回顾性验证实现闭环优化。每次迭代时，PIPO评估上一次更新是否在滑动窗口历史基线下带来真实改进，然后积极强化有益更新并抑制有害更新——将开环过程转变为自我修正过程。我们提供了理论分析，表明PIPO在预期中实现了PIRL目标的上升，数学推理基准测试显示其稳定性和性能优于GRPO及其变体。

Flow-based Policy With Distributional Reinforcement Learning in Trajectory Optimization

基于流的策略结合分布强化学习在轨迹优化中的应用

Authors: Ruijie Hao, Longfei Zhang, Yang Dai, Yang Ma, Xingxing Liang, Guangquan Cheng
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2604.00977
Pdf link: https://arxiv.org/pdf/2604.00977
Abstract Reinforcement Learning (RL) has proven highly effective in addressing complex control and decision-making tasks. However, in most traditional RL algorithms, the policy is typically parameterized as a diagonal Gaussian distribution, which constrains the policy from capturing multimodal distributions, making it difficult to cover the full range of optimal solutions in multi-solution problems, and the return is reduced to a mean value, losing its multimodal nature and thus providing insufficient guidance for policy updates. In response to these problems, we propose a RL algorithm termed flow-based policy with distributional RL (FP-DRL). This algorithm models the policy using flow matching, which offers both computational efficiency and the capacity to fit complex distributions. Additionally, it employs a distributional RL approach to model and optimize the entire return distribution, thereby more effectively guiding multimodal policy updates and improving agent performance. Experimental trails on MuJoCo benchmarks demonstrate that the FP-DRL algorithm achieves state-of-the-art (SOTA) performance in most MuJoCo control tasks while exhibiting superior representation capability of the flow policy.
中文摘要 强化学习（RL）在处理复杂的控制和决策任务方面已被证明非常有效。然而，在大多数传统强化学习算法中，策略通常参数化为对角高斯分布，这限制了策略捕获多模分布，使得在多解问题中难以覆盖全部最优解范围，且返回被简化为均值，失去多模特性，从而无法为策略更新提供足够的指导。针对这些问题，我们提出了一种名为基于流的分布式RL策略（FP-DRL）的强化学习算法。该算法通过流量匹配来建模策略，既提高了计算效率，又具备拟合复杂分布的能力。此外，它采用分布式强化学习方法来建模和优化整个收益分布，从而更有效地引导多模态策略更新并提升代理性能。MuJoCo基准测试的实验试验表明，FP-DRL算法在大多数MuJoCo控制任务中实现了最先进的（SOTA）性能，同时展现出更优的流量策略表示能力。

Query-Conditioned Evidential Keyframe Sampling for MLLM-Based Long-Form Video Understanding

基于MLLM的长视频理解中的查询条件证据帧抽样

Authors: Yiheng Wang, Lichen Zhu, Yueqian Lin, Yudong Liu, Jingyang Zhang, Hai "Helen" Li, Yiran Chen
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2604.01002
Pdf link: https://arxiv.org/pdf/2604.01002
Abstract Multimodal Large Language Models (MLLMs) have shown strong performance on video question answering, but their application to long-form videos is constrained by limited context length and computational cost, making keyframe sampling essential. Existing approaches typically rely on semantic relevance or reinforcement learning, which either fail to capture evidential clues or suffer from inefficient combinatorial optimization. In this work, we propose an evidence-driven keyframe sampling framework grounded in information bottleneck theory. We formulate keyframe selection as maximizing the conditional mutual information between selected frames and the query, providing a principled objective that reflects each frame's contribution to answering the question. To make this objective tractable, we exploit its structure to derive a decomposed optimization that reduces subset selection to independent frame-level scoring. We further introduce a query-conditioned evidence scoring network trained with a contrastive objective to estimate evidential importance efficiently. Experiments on long-form video understanding benchmarks show that our method consistently outperforms prior sampling strategies under strict token budgets, while significantly improving training efficiency.
中文摘要 多模态大型语言模型（MLLMs）在视频问答方面表现出良好表现，但其在长视频中的应用受限于有限的上下文长度和计算成本，因此关键帧采样至关重要。现有方法通常依赖语义相关性或强化学习，这些方法要么无法捕捉证据线索，要么存在组合优化效率低下的问题。本研究提出基于信息瓶颈理论的证据驱动关键帧抽样框架。我们将关键帧选择表述为最大化所选帧与查询之间的条件互信息，提供一个原则性目标，反映每个帧对回答问题的贡献。为了使该目标可操作，我们利用其结构推导出一个分解优化，将子集选择简化为独立的帧级评分。我们还进一步引入了一个查询条件证据评分网络，训练目标对比，以高效估计证据重要性。长视频理解基准测试的实验表明，在严格的代币预算下，我们的方法持续优于以往的采样策略，同时显著提升了训练效率。

Adversarial Attacks in AI-Driven RAN Slicing: SLA Violations and Recovery

AI驱动的RAN切片中的对抗性攻击：SLA违规与恢复

Authors: Deemah H. Tashman, Soumaya Cherkaoui
Subjects: Subjects: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2604.01049
Pdf link: https://arxiv.org/pdf/2604.01049
Abstract Next-generation (NextG) cellular networks are designed to support emerging applications with diverse data rate and latency requirements, such as immersive multimedia services and large-scale Internet of Things deployments. A key enabling mechanism is radio access network (RAN) slicing, which dynamically partitions radio resources into virtual resource blocks to efficiently serve heterogeneous traffic classes, including enhanced mobile broadband (eMBB), massive machine-type communications (mMTC), and ultra-reliable low-latency communications (URLLC). In this paper, we study the impact of adversarial attacks on AI-driven RAN slicing decisions, where a budget-constrained adversary selectively jams slice transmissions to bias deep reinforcement learning (DRL)-based resource allocation, and quantify the resulting service level agreement (SLA) violations and post-attack recovery behavior. Our results indicate that budget-constrained adversarial jamming can induce severe and slice-dependent steady-state SLA violations. Moreover, the DRL agent's reward converges toward the clean baseline only after a non-negligible recovery period.
中文摘要 下一代（NextG）蜂窝网络旨在支持具有不同数据速率和延迟需求的新兴应用，如沉浸式多媒体服务和大规模物联网部署。一个关键的促成机制是无线接入网络（RAN）切片，它动态将无线资源划分为虚拟资源块，以高效服务异构流量类别，包括增强型移动宽带（eMBB）、大规模机器型通信（mMTC）和超可靠低延迟通信（URLLC）。本文研究了对抗性攻击对AI驱动的RAN切片决策的影响，即预算受限的对手选择性地干扰切片传输，以偏向深度强化学习（DRL）资源分配，并量化由此产生的服务水平协议（SLA）违规行为及攻击后恢复行为。我们的结果表明，预算受限的对抗性干扰可能导致严重且依赖切片的稳态SLA违规。此外，DRL代理的奖励只有在不可忽视的恢复期后才趋向干净基线。

BAT: Balancing Agility and Stability via Online Policy Switching for Long-Horizon Whole-Body Humanoid Control

BAT：通过在线策略切换平衡敏捷性与稳定性，实现长期全身人形控制

Authors: Donghoon Baek, Sang-Hun Kim, Sehoon Ha
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2604.01064
Pdf link: https://arxiv.org/pdf/2604.01064
Abstract Despite recent advances in control, reinforcement learning, and imitation learning, developing a unified framework that can achieve agile, precise, and robust whole-body behaviors, particularly in long-horizon tasks, remains challenging. Existing approaches typically follow two paradigms: coupled whole-body policies for global coordination and decoupled policies for modular precision. However, without a systematic method to integrate both, this trade-off between agility, robustness, and precision remains unresolved. In this work, we propose BAT, an online policy-switching framework that dynamically selects between two complementary whole-body RL controllers to balance agility and stability across different motion contexts. Our framework consists of two complementary modules: a switching policy learned via hierarchical RL with an expert guidance from sliding-horizon policy pre-evaluation, and an option-aware VQ-VAE that predicts option preference from discrete motion token sequences for improved generalization. The final decision is obtained via confidence-weighted fusion of two modules. Extensive simulations and real-world experiments on the Unitree G1 humanoid robot demonstrate that BAT enables versatile long-horizon loco-manipulation and outperforms prior methods across diverse tasks.
中文摘要 尽管控制、强化学习和模仿学习近年来取得了进步，但开发一个能够实现敏捷、精准和稳健的全身行为的统一框架，尤其是在长期任务中，仍然具有挑战性。现有方法通常遵循两种范式：用于全局协调的耦合整体策略和用于模块化精确的解耦策略。然而，如果没有系统化的方法来整合两者，灵活性、稳健性和精准度之间的权衡依然未被解决。在本研究中，我们提出了BAT，一种在线策略切换框架，能够动态选择两个互补的全身强化学习控制器，以在不同运动环境中平衡敏捷性和稳定性。我们的框架由两个互补模块组成：通过层级强化学习的切换策略，并由滑动视界策略预评估获得专家指导;以及一个基于选项的VQ-VAE，通过离散运动令牌序列预测期权偏好，以提升泛化效果。最终决策通过置信加权两个模块的融合得出。对Unitree G1人形机器人的广泛模拟和实际实验表明，BAT能够实现灵活的长视距机车操作，并在多种任务中优于以往方法。

Multi-Agent LLM Governance for Safe Two-Timescale Reinforcement Learning in SDN-IoT Defense

SDN-IoT 防御中安全两时间尺度强化学习的多代理大型语言模型治理

Authors: Saeid Jamshidi, Negar Shahabi, Foutse Khomh, Carol Fung, Mohammad Hamdaqa
Subjects: Subjects: Cryptography and Security (cs.CR)
Arxiv link: https://arxiv.org/abs/2604.01127
Pdf link: https://arxiv.org/pdf/2604.01127
Abstract Software-Defined Networking (SDN) is increasingly adopted to secure Internet-of-Things (IoT) networks due to its centralized control and programmable forwarding. However, SDN-IoT defense is inherently a closed-loop control problem in which mitigation actions impact controller workload, queue dynamics, rule-installation delay, and future traffic observations. Aggressive mitigation may destabilize the control plane, degrade Quality of Service (QoS), and amplify systemic risk. Existing learning-based approaches prioritize detection accuracy while neglecting controller coupling and short-horizon Reinforcement Learning (RL) optimization without structured, auditable policy evolution. This paper introduces a self-reflective two-timescale SDN-IoT defense solution separating fast mitigation from slow policy governance. At the fast timescale, per-switch Proximal Policy Optimization (PPO) agents perform controller-aware mitigation under safety constraints and action masking. At the slow timescale, a multi-agent Large Language Model (LLM) governance engine generates machine-parsable updates to the global policy constitution Pi, which encodes admissible actions, safety thresholds, and reward priorities. Updates (Delta Pi) are validated through stress testing and deployed only with non-regression and safety guarantees, ensuring an auditable evolution without retraining RL agents. Evaluation under heterogeneous IoT traffic and adversarial stress shows improvements of 9.1% Macro-F1 over PPO and 15.4% over static baselines. Worst-case degradation drops by 36.8%, controller backlog peaks by 42.7%, and RTT p95 inflation remains below 5.8% under high-intensity attacks. Policy evolution converges within five cycles, reducing catastrophic overload from 11.6% to 2.3%.
中文摘要 软件定义网络（SDN）因其集中控制和可编程转发，越来越多地被用于物联网（IoT）网络的安全保障。然而，SDN-物联网防御本质上是一个闭环控制问题，缓解措施会影响控制器工作负载、队列动态、规则安装延迟以及未来流量观测。积极的缓解可能会破坏控制平面，降低服务质量（QoS），并放大系统性风险。现有基于学习的方法优先考虑检测准确性，忽视了控制器耦合和短视野强化学习（RL）优化，缺乏结构化、可审计的策略演进。本文介绍了一种自我反思的两时间尺度SDN-物联网防御解决方案，区分快速缓解与缓慢的政策治理。在快速时间尺度下，每个交换机的近端策略优化（PPO）代理在安全约束和动作掩蔽下执行控制器感知的缓解。在较慢的时间尺度下，一个多智能体大型语言模型（LLM）治理引擎生成机器可解析的全球政策架构Pi的更新，该结构编码可接受的行为、安全阈值和奖励优先级。更新（Delta Pi）通过压力测试验证，并仅在非回归和安全保证下部署，确保可审计的演进，无需重新训练强化智能体。在异构物联网流量和对抗压力下的评估显示，宏观F1比PPO提升了9.1%，较静态基线提升了15.4%。最坏情况下降级下降36.8%，控制器积压高峰下降42.7%，RTT p95通胀率在高强度攻击下仍低于5.8%。政策演变将在五个周期内趋于收敛，将灾难性超载率从11.6%降至2.3%。

Deep Reinforcement Learning for Robotic Manipulation under Distribution Shift with Bounded Extremum Seeking

具有有界极值寻寻的分布转移下机器人操作的深度强化学习

Authors: Shaifalee Saxena, Rafael Fierro, Alexander Scheinker
Subjects: Subjects: Robotics (cs.RO); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2604.01142
Pdf link: https://arxiv.org/pdf/2604.01142
Abstract Reinforcement learning has shown strong performance in robotic manipulation, but learned policies often degrade in performance when test conditions differ from the training distribution. This limitation is especially important in contact-rich tasks such as pushing and pick-and-place, where changes in goals, contact conditions, or robot dynamics can drive the system out-of-distribution at inference time. In this paper, we investigate a hybrid controller that combines reinforcement learning with bounded extremum seeking to improve robustness under such conditions. In the proposed approach, deep deterministic policy gradient (DDPG) policies are trained under standard conditions on the robotic pushing and pick-and-place tasks, and are then combined with bounded ES during deployment. The RL policy provides fast manipulation behavior, while bounded ES ensures robustness of the overall controller to time variations when operating conditions depart from those seen during training. The resulting controller is evaluated under several out-of-distribution settings, including time-varying goals and spatially varying friction patches.
中文摘要 强化学习在机器人操作中表现出优异表现，但当测试条件与训练分布不同时，已学策略的性能常常下降。这一限制在接触密集的任务中尤为重要，如推销和选位，因为目标、接触条件或机器人动态的变化可能导致系统在推断时偏离分布。本文探讨了一种结合强化学习与有界极值寻求的混合控制器，以提升在此类条件下的鲁棒性。在所提方法中，深度确定性策略梯度（DDPG）策略在机器人推送和拣选放置任务的标准条件下训练，然后在部署时与有界ES结合。强化操作策略提供快速的操作行为，而有界ES确保整体控制器在操作条件与训练期间不同时，能够稳健应对时间变化。最终的控制器会在多种分布外的设置下进行评估，包括时间变化的目标和空间变化的摩擦区。

Embarrassingly Simple Self-Distillation Improves Code Generation

令人尴尬的简单自蒸馏提升了代码生成

Authors: Ruixiang Zhang, Richard He Bai, Huangjie Zheng, Navdeep Jaitly, Ronan Collobert, Yizhe Zhang
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2604.01193
Pdf link: https://arxiv.org/pdf/2604.01193
Abstract Can a large language model (LLM) improve at code generation using only its own raw outputs, without a verifier, a teacher model, or reinforcement learning? We answer in the affirmative with simple self-distillation (SSD): sample solutions from the model with certain temperature and truncation configurations, then fine-tune on those samples with standard supervised fine-tuning. SSD improves Qwen3-30B-Instruct from 42.4% to 55.3% pass@1 on LiveCodeBench v6, with gains concentrating on harder problems, and it generalizes across Qwen and Llama models at 4B, 8B, and 30B scale, including both instruct and thinking variants. To understand why such a simple method can work, we trace these gains to a precision-exploration conflict in LLM decoding and show that SSD reshapes token distributions in a context-dependent way, suppressing distractor tails where precision matters while preserving useful diversity where exploration matters. Taken together, SSD offers a complementary post-training direction for improving LLM code generation.
中文摘要 大型语言模型（LLM）能否仅凭自身原始输出提升代码生成能力，而无需验证器、教师模型或强化学习？我们用简单自蒸馏（SSD）回答肯定：从模型中抽取特定温度和截断配置的解，然后用标准监督微调对这些样本进行微调。SSD在LiveCodeBench v6上将Qwen3-30B-Instruct的提升率从42.4%提升到55.3 pass@1%，提升重点放在更难的问题上，并且在4B、8B和30B尺度的Qwen和Llama模型中推广，包括教学和思考两个变体。为了理解为何如此简单的方法能奏效，我们将这些成果追溯到大型语言模型解码中的精度与探索冲突，并展示了SSD以上下文依赖的方式重塑代币分布，在精度关键处抑制干扰尾巴，同时在探索重要处保留有用的多样性。综合来看，SSD为提升LLM代码生成提供了互补的训练后指导。

Keyword: diffusion policy

There is no result