生成时间: 2026-03-16 17:00:27 (UTC+8); Arxiv 发布时间: 2026-03-16 20:00 EDT (2026-03-17 08:00 UTC+8)
今天共有 28 篇相关文章
Keyword: reinforcement learning
Thermodynamics of Reinforcement Learning Curricula
强化学习课程的热力学
- Authors: Jacob Adamczyk, Juan Sebastian Rojas, Rahul V. Kulkarni
- Subjects: Subjects:
Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2603.12324
- Pdf link: https://arxiv.org/pdf/2603.12324
- Abstract
Connections between statistical mechanics and machine learning have repeatedly proven fruitful, providing insight into optimization, generalization, and representation learning. In this work, we follow this tradition by leveraging results from non-equilibrium thermodynamics to formalize curriculum learning in reinforcement learning (RL). In particular, we propose a geometric framework for RL by interpreting reward parameters as coordinates on a task manifold. We show that, by minimizing the excess thermodynamic work, optimal curricula correspond to geodesics in this task space. As an application of this framework, we provide an algorithm, "MEW" (Minimum Excess Work), to derive a principled schedule for temperature annealing in maximum-entropy RL.
- 中文摘要
统计力学与机器学习之间的联系屡次被证明富有成效,为优化、泛化和表征学习提供了见解。在本研究中,我们沿袭这一传统,利用非平衡热力学的结果,形式化强化学习(RL)中的课程学习。特别地,我们提出了一个通过将奖励参数解释为任务流形上的坐标来构建强化学习的几何框架。我们证明,通过最小化多余的热力学工作,最优课程就对应该任务空间中的测地线。作为该框架的应用,我们提供了一种算法“最小过剩功”(MEW),用于推导最大熵强化学习中温度退火的原则性进度。
Maximum Entropy Exploration Without the Rollouts
最大熵探索,无需推广
- Authors: Jacob Adamczyk, Adam Kamoski, Rahul V. Kulkarni
- Subjects: Subjects:
Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2603.12325
- Pdf link: https://arxiv.org/pdf/2603.12325
- Abstract
Efficient exploration remains a central challenge in reinforcement learning, serving as a useful pretraining objective for data collection, particularly when an external reward function is unavailable. A principled formulation of the exploration problem is to find policies that maximize the entropy of their induced steady-state visitation distribution, thereby encouraging uniform long-run coverage of the state space. Many existing exploration approaches require estimating state visitation frequencies through repeated on-policy rollouts, which can be computationally expensive. In this work, we instead consider an intrinsic average-reward formulation in which the reward is derived from the visitation distribution itself, so that the optimal policy maximizes steady-state entropy. An entropy-regularized version of this objective admits a spectral characterization: the relevant stationary distributions can be computed from the dominant eigenvectors of a problem-dependent transition matrix. This insight leads to a novel algorithm for solving the maximum entropy exploration problem, EVE (EigenVector-based Exploration), which avoids explicit rollouts and distribution estimation, instead computing the solution through iterative updates, similar to a value-based approach. To address the original unregularized objective, we employ a posterior-policy iteration (PPI) approach, which monotonically improves the entropy and converges in value. We prove convergence of EVE under standard assumptions and demonstrate empirically that it efficiently produces policies with high steady-state entropy, achieving competitive exploration performance relative to rollout-based baselines in deterministic grid-world environments.
- 中文摘要
高效的探索仍然是强化学习的核心挑战,作为数据收集的有用预训练目标,尤其是在外部奖励函数不可用时。探索问题的一个原则性表述是找到最大化其诱导稳态访问分布熵的策略,从而促进状态空间的均匀长期覆盖。许多现有探索方法需要通过反复的政策内推广来估算州访问频率,这可能计算成本高。在本研究中,我们考虑一种内在平均-奖励表述,其中奖励源自访问分布本身,使最优策略最大化稳态熵。该目标的熵正则化版本允许谱特征化:相关的平稳分布可以从问题依赖的转移矩阵的主特征向量计算出来。这一见解催生了一种新颖算法——EVE(基于特征向量的探索),该算法避免显式展开和分布估计,而是通过迭代更新计算解,类似于基于价值的方法。为了解决原始的非正则化目标,我们采用了后验策略迭代(PPI)方法,单调地提升熵并收敛于价值。我们在标准假设下证明了EVE的收敛性,并通过实证证明它能够高效生成具有高稳态熵的政策,在确定性网格世界中相较于基于推广的基线实现了具有竞争力的探索性能。
Beyond Motion Imitation: Is Human Motion Data Alone Sufficient to Explain Gait Control and Biomechanics?
超越运动模拟:仅凭人体运动数据是否足以解释步态控制和生物力学?
- Authors: Xinyi Liu, Jangwhan Ahn, Edgar Lobaton, Jennie Si, He Huang
- Subjects: Subjects:
Robotics (cs.RO); Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/2603.12408
- Pdf link: https://arxiv.org/pdf/2603.12408
- Abstract
With the growing interest in motion imitation learning (IL) for human biomechanics and wearable robotics, this study investigates how additional foot-ground interaction measures, used as reward terms, affect human gait kinematics and kinetics estimation within a reinforcement learning-based IL framework. Results indicate that accurate reproduction of forward kinematics alone does not ensure biomechanically plausible joint kinetics. Adding foot-ground contacts and contact forces to the IL reward terms enables the prediction of joint moments in forward walking simulation, which are significantly closer to those computed by inverse dynamics. This finding highlights a fundamental limitation of motion-only IL approaches, which may prioritize kinematics matching over physical consistency. Incorporating kinetic constraints, particularly ground reaction force and center of pressure information, significantly enhances the realism of internal and external kinetics. These findings suggest that, when imitation learning is applied to human-related research domains such as biomechanics and wearable robot co-design, kinetics-based reward shaping is necessary to achieve physically consistent gait representations.
- 中文摘要
随着人体生物力学和可穿戴机器人领域运动模仿学习(IL)日益受到关注,本研究探讨了作为奖励项的额外足地互动测量如何影响基于强化学习的IL框架下的人类步态运动学和动力学估计。结果表明,仅准确复现前向运动学无法保证关节动力学的生物力学合理性。在IL奖励项中加入脚地接触和接触力,可以预测前行仿真中的关节力矩,这些矩明显更接近逆动力学计算的结果。这一发现凸显了仅运动IL方法的根本局限性,后者可能更注重运动学匹配而非物理一致性。结合动能约束,特别是地面反作用力和压力中心信息,显著提升了内外动力学的真实性。这些发现表明,当模仿学习应用于生物力学和可穿戴机器人联合设计等与人类相关的研究领域时,基于动力学的奖励塑造对于实现物理上一致的步态表现是必要的。
CALF: Communication-Aware Learning Framework for Distributed Reinforcement Learning
CALF:分布式强化学习的沟通感知学习框架
- Authors: Carlos Purves, Pietro Lio'
- Subjects: Subjects:
Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2603.12543
- Pdf link: https://arxiv.org/pdf/2603.12543
- Abstract
Distributed reinforcement learning policies face network delays, jitter, and packet loss when deployed across edge devices and cloud servers. Standard RL training assumes zero-latency interaction, causing severe performance degradation under realistic network conditions. We introduce CALF (Communication-Aware Learning Framework), which trains policies under realistic network models during simulation. Systematic experiments demonstrate that network-aware training substantially reduces deployment performance gaps compared to network-agnostic baselines. Distributed policy deployments across heterogeneous hardware validate that explicitly modelling communication constraints during training enables robust real-world execution. These findings establish network conditions as a major axis of sim-to-real transfer for Wi-Fi-like distributed deployments, complementing physics and visual domain randomisation.
- 中文摘要
分布式强化学习策略在部署于边缘设备和云服务器时,会面临网络延迟、抖动和丢包等问题。标准强化学习假设零延迟交互,导致在现实网络条件下严重性能下降。我们介绍了CALF(沟通感知学习框架),在模拟过程中根据真实的网络模型训练政策。系统实验表明,网络感知训练相比网络无关基线,显著减少了部署性能差距。跨异构硬件的分布式策略部署验证了在训练期间明确建模通信约束能够实现稳健的实际执行。这些发现确立了网络条件作为类Wi-Fi分布式部署模拟到现实传输的主要轴线,补充了物理和视觉域随机化。
Reinforcement Learning for Diffusion LLMs with Entropy-Guided Step Selection and Stepwise Advantages
具有熵引导步选和分步优势的扩散大型语言模型的强化学习
- Authors: Vishnu Teja Kunde, Fatemeh Doudi, Mahdi Farahbakhsh, Dileep Kalathil, Krishna Narayanan, Jean-Francois Chamberland
- Subjects: Subjects:
Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
- Arxiv link: https://arxiv.org/abs/2603.12554
- Pdf link: https://arxiv.org/pdf/2603.12554
- Abstract
Reinforcement learning (RL) has been effective for post-training autoregressive (AR) language models, but extending these methods to diffusion language models (DLMs) is challenging due to intractable sequence-level likelihoods. Existing approaches therefore rely on surrogate likelihoods or heuristic approximations, which can introduce bias and obscure the sequential structure of denoising. We formulate diffusion-based sequence generation as a finite-horizon Markov decision process over the denoising trajectory and derive an exact, unbiased policy gradient that decomposes over denoising steps and is expressed in terms of intermediate advantages, without requiring explicit evaluation of the sequence likelihood. To obtain a practical and compute-efficient estimator, we (i) select denoising steps for policy updates via an entropy-guided approximation bound, and (ii) estimate intermediate advantages using a one-step denoising reward naturally provided by the diffusion model, avoiding costly multi-step rollouts. Experiments on coding and logical reasoning benchmarks demonstrate state-of-the-art results, with strong competitive performance on mathematical reasoning, outperforming existing RL post-training approaches for DLMs. Code is available at this https URL.
- 中文摘要
强化学习(RL)在训练后自回归(AR)语言模型中表现有效,但由于序列层级似然问题难以解决,将这些方法推广到扩散语言模型(DLMs)则具有挑战性。因此,现有方法依赖替代似然或启发式近似,这可能会引入偏置并模糊去噪的顺序结构。我们将基于扩散的序列生成表述为对去噪轨迹的有限视界马尔可夫决策过程,并推导出一个精确且无偏的策略梯度,该梯度在去噪步骤上分解,并以中间优势表示,无需显式计算序列似然。为了获得一个实用且计算高效的估计器,我们(i)通过熵引导近似界限选择策略更新的去噪步骤,以及(ii)利用扩散模型自然提供的一步去噪奖励估计中间优势,避免成本高昂的多步推广。编码和逻辑推理基准测试的实验显示出最先进的成果,在数学推理方面表现出色,优于现有的强化学习后训练方法。代码可在此 https 网址获取。
A Spectral Revisit of the Distributional Bellman Operator under the Cramér Metric
在克拉梅尔度规下分布贝尔曼算子的谱重访
- Authors: Keru Wang, Yixin Deng, Yao Lyu, Stephen Redmond, Shengbo Eben Li
- Subjects: Subjects:
Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/2603.12576
- Pdf link: https://arxiv.org/pdf/2603.12576
- Abstract
Distributional reinforcement learning (DRL) studies the evolution of full return distributions under Bellman updates rather than focusing on expected values. A classical result is that the distributional Bellman operator is contractive under the Cramér metric, which corresponds to an $L^2$ geometry on differences of cumulative distribution functions (CDFs). While this contraction ensures stability of policy evaluation, existing analyses remain largely metric, focusing on contraction properties without elucidating the structural action of the Bellman update on distributions. In this work, we analyse distributional Bellman dynamics directly at the level of CDFs, treating the Cramér geometry as the intrinsic analytical setting. At this level, the Bellman update acts affinely on CDFs and linearly on differences between CDFs, and its contraction property yields a uniform bound on this linear action. Building on this intrinsic formulation, we construct a family of regularised spectral Hilbert representations that realise the CDF-level geometry by exact conjugation, without modifying the underlying Bellman dynamics. The regularisation affects only the geometry and vanishes in the zero-regularisation limit, recovering the native Cramér metric. This framework clarifies the operator structure underlying distributional Bellman updates and provides a foundation for further functional and operator-theoretic analyses in DRL.
- 中文摘要
分布强化学习(DRL)研究在Bellman更新下全返回分布的演变,而非关注期望值。一个经典结果是,分布的贝尔曼算符在克拉梅尔度规下是收缩的,而克雷梅尔度规对应于累积分布函数(CDF)差分的 $L^2$ 几何。虽然这种收缩确保了政策评估的稳定性,但现有分析仍大多依赖度量化,侧重于收缩属性,未能阐明Bellman对分布更新的结构性作用。在本研究中,我们直接分析分布贝尔曼动力学,直接在CDF层面,将克拉梅尔几何视为内在的解析环境。在此层面,贝尔曼更新对CDFs产生仿射作用,对CDF之间的差异线性作用,其收缩性质使该线性作用形成均匀界限。基于这一内在表述,我们构建了一族正则化的谱希尔伯特表示,通过精确共轭实现CDF级几何,而不修改底层的贝尔曼动力学。正则化仅影响几何,并在零正则化极限处消失,恢复了原生的克拉梅尔度量。该框架阐明了分布式贝尔曼更新背后的算符结构,并为DRL中进一步的函数和算子理论分析奠定了基础。
Swap-guided Preference Learning for Personalized Reinforcement Learning from Human Feedback
交换引导偏好学习,实现基于人类反馈的个性化强化学习
- Authors: Gihoon Kim, Euntai Kim
- Subjects: Subjects:
Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2603.12595
- Pdf link: https://arxiv.org/pdf/2603.12595
- Abstract
Reinforcement Learning from Human Feedback (RLHF) is a widely used approach to align large-scale AI systems with human values. However, RLHF typically assumes a single, universal reward, which overlooks diverse preferences and limits personalization. Variational Preference Learning (VPL) seeks to address this by introducing user-specific latent variables. Despite its promise, we found that VPL suffers from posterior collapse. While this phenomenon is well known in VAEs, it has not previously been identified in preference learning frameworks. Under sparse preference data and with overly expressive decoders, VPL may cause latent variables to be ignored, reverting to a single-reward model. To overcome this limitation, we propose Swap-guided Preference Learning (SPL). The key idea is to construct fictitious swap annotators and use the mirroring property of their preferences to guide the encoder. SPL introduces three components: (1) swap-guided base regularization, (2) Preferential Inverse Autoregressive Flow (P-IAF), and (3) adaptive latent conditioning. Experiments show that SPL mitigates collapse, enriches user-specific latents, and improves preference prediction. Our code and data are available at this https URL
- 中文摘要
人类反馈强化学习(RLHF)是一种广泛应用的方法,用于将大规模人工智能系统与人类价值观对齐。然而,RLHF通常假设一个统一的通用奖励,忽视了多样化的偏好并限制了个性化。变分偏好学习(VPL)通过引入用户特定的潜在变量来解决这个问题。尽管前景看好,我们发现VPL存在后部塌陷的问题。虽然这一现象在VAE中广为人知,但在偏好学习框架中尚未被发现。在偏好数据稀疏且解码器过于表达性的情况下,VPL可能导致潜在变量被忽略,退回到单一奖励模型。为克服这一限制,我们提出了交换引导偏好学习(SPL)。关键思想是构建虚构的交换注释器,并利用其偏好的镜像特性来指导编码器。SPL引入了三个组成部分:(1)交换引导碱基正则化,(2)优先逆自回归流(P-IAF),以及(3)自适应潜在条件。实验表明,SPL能够减轻塌陷,丰富用户特定的潜在因素,并提升偏好预测。我们的代码和数据可在此 https URL 获取
FastDSAC: Unlocking the Potential of Maximum Entropy RL in High-Dimensional Humanoid Control
FastDSAC:释放高维类人生物控制中最大熵强化学习的潜力
- Authors: Jun Xue, Junze Wang, Xinming Zhang, Shanze Wang, Yanjun Chen, Wei Zhang
- Subjects: Subjects:
Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2603.12612
- Pdf link: https://arxiv.org/pdf/2603.12612
- Abstract
Scaling Maximum Entropy Reinforcement Learning (RL) to high-dimensional humanoid control remains a formidable challenge, as the ``curse of dimensionality'' induces severe exploration inefficiency and training instability in expansive action spaces. Consequently, recent high-throughput paradigms have largely converged on deterministic policy gradients combined with massive parallel simulation. We challenge this compromise with FastDSAC, a framework that effectively unlocks the potential of maximum entropy stochastic policies for complex continuous control. We introduce Dimension-wise Entropy Modulation (DEM) to dynamically redistribute the exploration budget and enforce diversity, alongside a continuous distributional critic tailored to ensure value fidelity and mitigate high-dimensional value overestimation. Extensive evaluations on HumanoidBench and other continuous control tasks demonstrate that rigorously designed stochastic policies can consistently match or outperform deterministic baselines, achieving notable gains of 180\% and 400\% on the challenging \textit{Basketball} and \textit{Balance Hard} tasks.
- 中文摘要
将最大熵强化学习(RL)扩展到高维类人生物控制仍是一项艰巨挑战,因为“维度诅咒”导致在广阔行动空间中严重的探索效率低下和训练不稳定。因此,近年来的高通量范式大多趋于确定性策略梯度与大规模并行仿真相结合。我们通过FastDSAC挑战这一妥协,该框架有效释放了最大熵随机策略在复杂连续控制中的潜力。我们引入了按维度的熵调制(DEM),以动态重新分配勘探预算并强制多样性,同时采用连续分布批评器,确保价值忠实度并减少高维价值高估。对HumanoidBench及其他持续控制任务的广泛评估表明,严格设计的随机策略能够持续匹配或超越确定性基线,在具有挑战性的\textit{Basketball}和\textit{Balance Hard}任务中分别取得了180%和400%的显著提升。
Collaborative Multi-Agent Optimization for Personalized Memory System
个性化记忆系统的协作多智能体优化
- Authors: Wenyu Mao, Haoyang Liu, Zhao Liu, Haosong Tan, Yaorui Shi, Jiancan Wu, An Zhang, Xiang Wang
- Subjects: Subjects:
Multiagent Systems (cs.MA)
- Arxiv link: https://arxiv.org/abs/2603.12631
- Pdf link: https://arxiv.org/pdf/2603.12631
- Abstract
Memory systems are crucial to personalized LLMs by mitigating the context window limitation in capturing long-term user-LLM conversations. Typically, such systems leverage multiple agents to handle multi-granular memory construction and personalized memory retrieval tasks. To optimize the system, existing methods focus on specializing agents on their local tasks independently via prompt engineering or fine-tuning. However, they overlook cross-agent collaboration, where independent optimization on local agents hardly guarantees the global system performance. To address this issue, we propose a Collaborative Reinforcement Learning Framework for Multi-Agent Memory Systems (CoMAM), jointly optimizing local agents to facilitate collaboration. Specifically, we regularize agents' execution as a sequential Markov decision process (MDP) to embed inter-agent dependencies into the state transition, yielding both local task rewards (e.g., information coverage for memory construction) and global rewards (i.e., query-answer accuracy). Then, we quantify each agent's contribution via group-level ranking consistency between local and global rewards, treating them as adaptive weights to assign global credit and integrate local-global rewards. Each agent is optimized by these integrated rewards, aligning local improvements with the global performance. Experiments show CoMAM outperforms leading memory systems, validating the efficacy of our proposed collaborative reinforcement learning for joint optimization.
- 中文摘要
记忆系统对于个性化LLM至关重要,因为它缓解了上下文窗口限制,无法捕捉长期用户与LLM的对话。通常,这类系统利用多个代理来处理多粒度内存构建和个性化内存检索任务。为了优化系统,现有方法通过提示工程或微调,专注于独立地为代理分配本地任务。然而,他们忽视了跨代理协作,在这种协作中,独立优化本地代理几乎无法保证全局系统性能。为解决这一问题,我们提出了一个多智能体记忆系统(CoMAM)协作强化学习框架,联合优化本地智能体以促进协作。具体来说,我们将智能体的执行正则化为顺序马尔可夫决策过程(MDP),将代理间依赖嵌入状态转换中,从而既能获得局部任务奖励(如内存构建的信息覆盖率)又能获得全局奖励(即查询-答案的准确性)。然后,我们通过群体层面的本地与全球奖励排名一致性量化每个代理的贡献,将其视为适应权重,用于分配全球信用并整合本地-全球奖励。每个代理都通过这些整合的奖励进行优化,使局部改进与全球绩效保持一致。实验显示,CoMAM的表现优于主流记忆系统,验证了我们提出的协作强化学习在联合优化中的有效性。
RetroReasoner: A Reasoning LLM for Strategic Retrosynthesis Prediction
RetroReasoner:用于战略逆综合预测的推理大型语言模型
- Authors: Hanbum Ko, Chanhui Lee, Ye Rin Kim, Rodrigo Hormazabal, Sehui Han, Sungbin Lim, Sungwoong Kim
- Subjects: Subjects:
Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2603.12666
- Pdf link: https://arxiv.org/pdf/2603.12666
- Abstract
Retrosynthesis prediction is a core task in organic synthesis that aims to predict reactants for a given product molecule. Traditionally, chemists select a plausible bond disconnection and derive corresponding reactants, which is time-consuming and requires substantial expertise. While recent advancements in molecular large language models (LLMs) have made progress, many methods either predict reactants without strategic reasoning or conduct only a generic product analysis, rather than reason explicitly about bond-disconnection strategies that logically lead to the choice of specific reactants. To overcome these limitations, we propose RetroReasoner, a retrosynthetic reasoning model that leverages chemists' strategic thinking. RetroReasoner is trained using both supervised fine-tuning (SFT) and reinforcement learning (RL). For SFT, we introduce SyntheticRetro, a framework that generates structured disconnection rationales alongside reactant predictions. In the case of RL, we apply a round-trip accuracy as reward, where predicted reactants are passed through a forward synthesis model, and predictions are rewarded when the forward-predicted product matches the original input product. Experimental results show that RetroReasoner not only outperforms prior baselines but also generates a broader range of feasible reactant proposals, particularly in handling more challenging reaction instances.
- 中文摘要
逆合成预测是有机合成中的核心任务,旨在预测特定产物分子的反应物。传统上,化学家选择一个合理的键断开并推导出相应的反应物,这既耗时又需要丰富的专业知识。尽管分子大型语言模型(LLMs)的最新进展有所进展,许多方法要么在没有战略推理的情况下预测反应物,要么仅进行通用产物分析,而非明确推理逻辑上导致特定反应物选择的键断开策略。为克服这些局限,我们提出了RetroReasoner,一种利用化学家战略思维的逆合成推理模型。RetroReasoner 通过监督微调(SFT)和强化学习(RL)进行训练。对于SFT,我们引入了SyntheticRetro,这是一个与反应物预测并生成结构化断开理据的框架。在强化学习中,我们以往返准确率作为奖励,预测反应物通过正向合成模型,当预测产物与原始输入产物匹配时,预测结果获得奖励。实验结果显示,RetroReasoner 不仅优于以往基线,还能生成更广泛的可行反应物提案,尤其是在处理更具挑战性的反应实例时。
EvolveCoder: Evolving Test Cases via Adversarial Verification for Code Reinforcement Learning
EvolveCoder:通过对抗性验证演进测试用例以实现代码强化学习
- Authors: Chi Ruan, Dongfu Jiang, Huaye Zeng, Ping Nie, Wenhu Chen
- Subjects: Subjects:
Computation and Language (cs.CL)
- Arxiv link: https://arxiv.org/abs/2603.12698
- Pdf link: https://arxiv.org/pdf/2603.12698
- Abstract
Reinforcement learning with verifiable rewards (RLVR) is a promising approach for improving code generation in large language models, but its effectiveness is limited by weak and static verification signals in existing coding RL datasets. In this paper, we propose a solution-conditioned and adversarial verification framework that iteratively refines test cases based on the execution behaviors of candidate solutions, with the goal of increasing difficulty, improving discriminative power, and reducing redundancy. Based on this framework, we introduce EvolveCoder-22k, a large-scale coding reinforcement learning dataset constructed through multiple rounds of adversarial test case evolution. Empirical analysis shows that iterative refinement substantially strengthens verification, with pass@1 decreasing from 43.80 to 31.22. Reinforcement learning on EvolveCoder-22k yields stable optimization and consistent performance gains, improving Qwen3-4B by an average of 4.2 points across four downstream benchmarks and outperforming strong 4B-scale baselines. Our results highlight the importance of adversarial, solution-conditioned verification for effective and scalable reinforcement learning in code generation.
- 中文摘要
带可验证奖励的强化学习(RLVR)是一种有前景的方法,能够提升大型语言模型中的代码生成,但其有效性受限于现有编码强化学习数据集中验证信号较弱且静态。本文提出了一种基于解条件和对抗性验证框架的框架,基于候选解的执行行为迭代优化测试用例,目标是提高难度、提升判别能力并减少冗余。基于该框架,我们介绍了EvolveCoder-22k,一个通过多轮对抗性测试案例演化构建的大规模编码强化学习数据集。实证分析显示,迭代细化显著增强了验证,pass@1从43.80降至31.22。EvolveCoder-22k 上的强化学习实现了稳定的优化和持续的性能提升,在四个下游基准测试中平均提升了 Qwen3-4B 4.2 分,并优于强劲的 4B 尺度基线。我们的结果凸显了对抗性、解条件验证对于代码生成中有效且可扩展的强化学习的重要性。
Think and Answer ME: Benchmarking and Exploring Multi-Entity Reasoning Grounding in Remote Sensing
思考与回答我:基准测试与探索多实体推理 遥感基础
- Authors: Shuchang Lyu, Haiquan Wen, Guangliang Cheng, Meng Li, Zheng Zhou, You Zhou, Dingding Yao, Zhenwei Shi
- Subjects: Subjects:
Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/2603.12788
- Pdf link: https://arxiv.org/pdf/2603.12788
- Abstract
Recent advances in reasoning language models and reinforcement learning with verifiable rewards have significantly enhanced multi-step reasoning capabilities. This progress motivates the extension of reasoning paradigms to remote sensing visual grounding task. However, existing remote sensing grounding methods remain largely confined to perception-level matching and single-entity formulations, limiting the role of explicit reasoning and inter-entity modeling. To address this challenge, we introduce a new benchmark dataset for Multi-Entity Reasoning Grounding in Remote Sensing (ME-RSRG). Based on ME-RSRG, we reformulate remote sensing grounding as a multi-entity reasoning task and propose an Entity-Aware Reasoning (EAR) framework built upon visual-linguistic foundation models. EAR generates structured reasoning traces and subject-object grounding outputs. It adopts supervised fine-tuning for cold-start initialization and is further optimized via entity-aware reward-driven Group Relative Policy Optimization (GRPO). Extensive experiments on ME-RSRG demonstrate the challenges of multi-entity reasoning and verify the effectiveness of our proposed EAR framework. Our dataset, code, and models will be available at this https URL.
- 中文摘要
推理语言模型和带有可验证奖励的强化学习的最新进展显著提升了多步推理能力。这一进展推动了推理范式向遥感视觉接地任务的扩展。然而,现有的遥感基础方法仍主要局限于感知层匹配和单实体表述,限制了显式推理和实体间建模的作用。为应对这一挑战,我们引入了一个新的远程感测多实体推理基准数据集(ME-RSRG)。基于ME-RSRG,我们将遥感接地重新表述为多实体推理任务,并提出了基于视觉语言基础模型的实体感知推理(EAR)框架。EAR生成结构化推理轨迹和主体-客体接地输出。它采用监督微调来进行冷启动初始化,并通过实体感知奖励驱动的组相对策略优化(GRPO)进一步优化。ME-RSRG的广泛实验展示了多实体推理的挑战,并验证了我们提出的EAR框架的有效性。我们的数据集、代码和模型将在此 https URL 上提供。
FLUX: Accelerating Cross-Embodiment Generative Navigation Policies via Rectified Flow and Static-to-Dynamic Learning
FLUX:通过整流和静态到动态学习加速跨实体生成导航策略
- Authors: Zeying Gong, Yangyi Zhong, Yiyi Ding, Tianshuai Hu, Guoyang Zhao, Lingdong Kong, Rong Li, Jiadi You, Junwei Liang
- Subjects: Subjects:
Robotics (cs.RO)
- Arxiv link: https://arxiv.org/abs/2603.12806
- Pdf link: https://arxiv.org/pdf/2603.12806
- Abstract
Autonomous navigation requires a broad spectrum of skills, from static goal-reaching to dynamic social traversal, yet evaluation remains fragmented across disparate protocols. We introduce DynBench, a dynamic navigation benchmark featuring physically valid crowd simulation. Combined with existing static protocols, it supports comprehensive evaluation across six fundamental navigation tasks. Within this framework, we propose FLUX, the first flow-based unified navigation policy. By linearizing probability flow, FLUX replaces iterative denoising with straight-line trajectories, improving per-step inference efficiency by 47% over prior flow-based methods and 29% over diffusion-based ones. Following a static-to-dynamic curriculum, FLUX initially establishes geometric priors and is subsequently refined through reinforcement learning in dynamic social environments. This regime not only strengthens socially-aware navigation but also enhances static task robustness by capturing recovery behaviors through stochastic action distributions. FLUX achieves state-of-the-art performance across all tasks and demonstrates zero-shot sim-to-real transfer on wheeled, quadrupedal, and humanoid platforms without any fine-tuning.
- 中文摘要
自主导航需要广泛的技能,从静态目标达成到动态社交穿越,但评估仍分散在不同协议之间。我们介绍DynBench,一个动态导航基准测试,具备物理有效的人群模拟功能。结合现有静态协议,支持六个基本导航任务的全面评估。在此框架下,我们提出了FLUX,这是首个基于流量的统一导航策略。通过线性化概率流,FLUX用直线轨迹替代迭代去噪,每步推断效率比以往基于流的方法提升47%,较扩散方法提升29%。遵循静态到动态的课程,FLUX最初建立几何先验,随后通过动态社会环境中的强化学习进行完善。这种模式不仅增强了社会意识导航,还通过随机动作分布捕捉恢复行为,增强了静态任务的鲁棒性。FLUX在所有任务中都实现了最先进的性能,并在轮式、四足和类人平台上演示了零发子模拟到真实的传输,无需微调。
Reinforcement Learning for Elliptical Cylinder Motion Control Tasks
椭圆圆柱体运动控制任务的强化学习
- Authors: Pawel Marczewski, Paulina Superczynska, Jakub Bernat, Szymon Szczesny
- Subjects: Subjects:
Robotics (cs.RO); Systems and Control (eess.SY)
- Arxiv link: https://arxiv.org/abs/2603.12807
- Pdf link: https://arxiv.org/pdf/2603.12807
- Abstract
The control of devices with limited input always bring attention to solve by research due to its difficulty and non-trival solution. For instance, the inverted pendulum is benchmarking problem in control theory and machine learning. In this work, we are focused on the elliptical cylinder and its motion under limited torque. The inspiration of the problem is from untethered magnetic devices, which due to distance have to operate with limited input torque. In this work, the main goal is to define the control problem of elliptic cylinder with limited input torque and solve it by Reinforcement Learning. As a classical baseline, we evaluate a two-stage controller composed of an energy-shaping swing-up law and a local Linear Quadratic Regulator (LQR) stabilizer around the target equilibrium. The swing-up controller increases the system's mechanical energy to drive the state toward a neighborhood of the desired equilibrium, a linearization of the nonlinear model yields an LQR that regulates the angle and angular-rate states to the target orientation with bounded input. This swing-up + LQR policy is a strong, interpretable reference for underactuated system and serves a point of comparison to the learned policy under identical limits and parameters. The solution shows that the learning is possible however, the different cases like stabilization in upward position or rotating of half turn are very difficult for increasing mass or ellipses with a strongly unequal perimeter ratio.
- 中文摘要
由于输入有限且非平凡的解法,控制输入有限的设备总是引发人们对研究的关注。例如,倒摆是控制理论和机器学习中的基准测试问题。本研究重点关注椭圆圆柱及其在有限扭矩下的运动。该问题的灵感来源于无系绳磁器件,由于距离限制,这些器件必须以有限的输入扭矩工作。本研究的主要目标是定义有限输入扭矩的椭圆圆柱控制问题,并通过强化学习求解。作为经典基线,我们评估了一个由能量整形振摆律和目标平衡周围的局部线性二次调节器(LQR)稳定器组成的两级控制器。摆动控制器提升系统的机械能,推动态趋向期望平衡的邻域,非线性模型的线性化得到一个LQR,调节角度和角率态到目标方向,输入有界。这种上振+LQR策略是欠致动系统的强有力且可解释的参考,并在相同限制和参数下作为与所学策略的对比点。解法表明学习是可能的,但对于质量增加或周长比极不等的椭圆来说,像向上稳定或旋转等不同情况非常困难。
A Multi-task Large Reasoning Model for Molecular Science
分子科学的多任务大推理模型
- Authors: Pengfei Liu, Shuang Ge, Jun Tao, Zhixiang Ren
- Subjects: Subjects:
Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/2603.12808
- Pdf link: https://arxiv.org/pdf/2603.12808
- Abstract
Advancements in artificial intelligence for molecular science are necessitating a paradigm shift from purely data-driven predictions to knowledge-guided computational reasoning. Existing molecular models are predominantly proprietary, lacking general molecular intelligence and generalizability. This underscores the necessity for computational methods that can effectively integrate scientific logic with deep learning architectures. Here we introduce a multi-task large reasoning model designed to emulate the cognitive processes of molecular scientists through structured reasoning and reflection. Our approach incorporates multi-specialist modules to provide versatile molecular expertise and a chain-of-thought (CoT) framework enhanced by reinforcement learning infused with molecular knowledge, enabling structured and reflective reasoning. Systematic evaluations across 10 molecular tasks and 47 metrics demonstrate that our model achieves an average 50.3% improvement over the base architecture, outperforming over 20 state-of-the-art baselines, including ultra-large-parameter foundation models, despite using significantly fewer training data and computational resources. This validates that embedding explicit reasoning mechanisms enables high-efficiency learning, allowing smaller-scale models to surpass massive counterparts in both efficacy and interpretability. The practical utility of this computational framework was validated through a case study on the design of central nervous system (CNS) drug candidates, illustrating its capacity to bridge data-driven and knowledge-integrated approaches for intelligent molecular design.
- 中文摘要
分子科学人工智能的进步迫使从纯数据驱动预测转向知识引导的计算推理。现有的分子模型主要是专有的,缺乏通用的分子智能和推广性。这凸显了能够有效将科学逻辑与深度学习架构整合的计算方法的必要性。这里我们介绍了一个多任务大型推理模型,旨在通过结构化推理和反思模拟分子科学家的认知过程。我们的方法融合了多专业模块,提供多样化的分子专业知识,以及通过注入分子知识的强化学习增强的思维链(Chain-of-Thought,CoT)框架,实现结构化和反思性推理。系统评估涵盖10个分子任务和47项指标,显示我们的模型平均比基础架构提升50.3%,优于20多个最先进基线,包括超大参数基础模型,尽管使用了显著较少的训练数据和计算资源。这验证了嵌入显式推理机制能够实现高效学习,使小规模模型在效能和解释性上超越庞大模型。通过中枢神经系统(CNS)药物候选方案设计的案例研究,验证了该计算框架的实用性,展示了其在智能分子设计中连接数据驱动与知识整合方法的能力。
Rethinking Multiple-Choice Questions for RLVR: Unlocking Potential via Distractor Design
重新思考RLVR的多项选择题:通过干扰设计释放潜力
- Authors: Xu Guo, Qiming Ge, Jian Tong, Kedi Chen, Jin Zhang, Xiaogui Yang, Xuan Gao, Haijun Lv, Zhihui Lu, Yicheng Zou, Qipeng Guo
- Subjects: Subjects:
Computation and Language (cs.CL)
- Arxiv link: https://arxiv.org/abs/2603.12826
- Pdf link: https://arxiv.org/pdf/2603.12826
- Abstract
Reinforcement Learning with Verifiable Rewards (RLVR) significantly enhances the reasoning capabilities of Large Language Models. When applied to RLVR, Multiple-Choice Questions (MCQs) offer a scalable source of verifiable data but risk inducing reward hacking, where models shortcut reasoning via random guessing or simple elimination. Current approaches often mitigate this by converting MCQs to open-ended formats, thereby discarding the contrastive signal provided by expert-designed distractors. In this work, we systematically investigate the impact of option design on RLVR. Our analysis highlights two primary insights: (1) Mismatches in option counts between training and testing degrade performance. (2) Strong distractors effectively mitigate random guessing, enabling effective RLVR training even with 2-way questions. Motivated by these findings, we propose Iterative Distractor Curation (IDC), a framework that actively constructs high-quality distractors to block elimination shortcuts and promote deep reasoning. Experiments on various benchmarks demonstrate that our method effectively enhances distractor quality and yields significant gains in RLVR training compared to the original data.
- 中文摘要
带可验证奖励的强化学习(RLVR)显著增强了大型语言模型的推理能力。应用于RLVR时,选择题(MCQ)提供了可扩展的可验证数据来源,但也存在风险,即通过随机猜测或简单消去法进行模型简化推理。目前的方法通常通过将选择题转换为开放式格式来缓解这一问题,从而摒弃专家设计的干扰器所提供的对比信号。本研究系统地探讨期权设计对RLVR的影响。我们的分析强调了两个主要见解:(1)训练和测试之间选项数量的不匹配会降低表现。(2)强干扰器有效减少随机猜测,即使在双向问题中也能有效进行RLVR训练。基于这些发现,我们提出了迭代分散策划(IDC)框架,该框架主动构建高质量的分散注意力工具,以阻断消除捷径并促进深度推理。基于多个基准测试的实验表明,我们的方法有效提升了干扰器质量,并且相比原始数据在RLVR训练方面取得了显著提升。
Beyond Imitation: Reinforcement Learning Fine-Tuning for Adaptive Diffusion Navigation Policies
超越模仿:强化学习为自适应扩散导航策略微调
- Authors: Junhe Sheng, Ruofei Bai, Kuan Xu, Ruimeng Liu, Jie Chen, Shenghai Yuan, Wei-Yun Yau, Lihua Xie
- Subjects: Subjects:
Robotics (cs.RO)
- Arxiv link: https://arxiv.org/abs/2603.12868
- Pdf link: https://arxiv.org/pdf/2603.12868
- Abstract
Diffusion-based robot navigation policies trained on large-scale imitation learning datasets, can generate multi-modal trajectories directly from the robot's visual observations, bypassing the traditional localization-mapping-planning pipeline and achieving strong zero-shot generalization. However, their performance remains constrained by the coverage of offline datasets, and when deployed in unseen settings, distribution shift often leads to accumulated trajectory errors and safety-critical failures. Adapting diffusion policies with reinforcement learning is challenging because their iterative denoising structure hinders effective gradient backpropagation, while also making the training of an additional value network computationally expensive and less stable. To address these issues, we propose a reinforcement learning fine-tuning framework tailored for diffusion-based navigation. The method leverages the inherent multi-trajectory sampling mechanism of diffusion models and adopts Group Relative Policy Optimization (GRPO), which estimates relative advantages across sampled trajectories without requiring a separate value network. To preserve pretrained representations while enabling adaptation, we freeze the visual encoder and selectively update the higher decoder layers and action head, enhancing safety-aware behaviors through online environmental feedback. On the PointGoal task in Isaac Sim, our approach improves the Success Rate from 52.0% to 58.7% and SPL from 0.49 to 0.54 on unseen scenes, while reducing collision frequency. Additional experiments show that the fine-tuned policy transfers zero-shot to a real quadruped platform and maintains stable performance in geometrically out-of-distribution environments, suggesting improved adaptability and safe generalization to new domains.
- 中文摘要
基于扩散的机器人导航策略在大规模模拟学习数据集上训练,能够直接从机器人的视觉观测中生成多模态轨迹,绕过传统的定位-测绘-规划流程,实现强的零样本泛化。然而,其性能仍受限于离线数据集的覆盖范围,且在未被发现的环境中部署时,分布偏移常常导致累积的轨迹误差和安全关键故障。强化学习中调整扩散策略具有挑战性,因为它们的迭代去噪结构阻碍了有效的梯度反向传播,同时使得额外值网络的训练计算成本高且稳定性较低。为解决这些问题,我们提出了一个针对基于扩散导航的强化学习微调框架。该方法利用扩散模型固有的多轨迹抽样机制,并采用群相对策略优化(Group Relative Policy Optimization,GRPO),该方法在不需单独的值网络的情况下,估计各采样轨迹的相对优势。为了保留预训练表征并实现适应,我们冻结视觉编码器,并选择性更新高层解码器和动作磁头,通过在线环境反馈增强安全意识行为。在Isaac Sim的PointGoal任务中,我们的方法将未见场景的成功率从52.0%提升到58.7%,将SPL从0.49提升到0.54,同时降低碰撞频率。其他实验显示,微调策略将零点训练转移到真实的四足平台,并在几何分布外的环境中保持稳定性能,表明其适应性提升,安全推广到新领域。
Test-time RL alignment exposes task familiarity artifacts in LLM benchmarks
测试时强化学习对齐暴露了大型语言模型基准测试中的任务熟悉度伪影
- Authors: Kun Wang, Reinhard Heckel
- Subjects: Subjects:
Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/2603.12875
- Pdf link: https://arxiv.org/pdf/2603.12875
- Abstract
Direct evaluation of LLMs on benchmarks can be misleading because comparatively strong performance may reflect task familiarity rather than capability. The train-before-test approach controls for task familiarity by giving each model task-relevant training before evaluation, originally through supervised finetuning. However, suitable training data is often hard to come by, and evaluation results vary with the data chosen. In this paper, we propose a two-stage test-time reinforcement learning (RL) alignment method for train-before-test. First, RL with a single sample provides a first alignment of the model to the task format, and second, test-time RL with majority-voting reward aligns the model to the benchmark distribution. Our test-time RL alignment method aligns similarly well as SFT-based train-before test, but without requiring a task-specific training set. On a domain-specific benchmark without training data, we show that direct evaluation underestimates base models which perform substantially better once aligned, yielding a more faithful evaluation of their capabilities. Moreover, for reasoning tasks, the performance gap between fine-tuned models and their base models largely disappears after alignment, suggesting that many gains from RLVR/SFT reported in the literature are not a difference in reasoning capability, but rather artifacts of task familiarity.
- 中文摘要
直接在基准测试上评估大型语言模型可能具有误导性,因为较强的性能可能反映的是任务熟悉度,而非能力。训练前测试方法通过在评估前为每个模型提供任务相关训练,最初通过监督微调来控制任务熟悉度。然而,合适的训练数据往往难以获得,评估结果也会因所选数据而异。本文提出了一种两阶段的测试时间强化学习(RL)对齐方法,用于训练后测试。首先,单样本的强化学习首次将模型与任务格式对齐;其次,带有多数投票奖励的测试时强化学习使模型与基准分布对齐。我们的测试时强化学习对齐方法与基于SFT的训练前测试类似,但不需要针对特定任务的训练集。在无训练数据的领域特定基准测试中,我们表明直接评估低估了基模型,而基模型在对齐后表现显著优异,从而对其能力进行了更忠实的评估。此外,对于推理任务,精细调优模型与其基础模型之间的性能差距在对齐后基本消失,表明文献中报道的RLVR/SFT带来的许多提升并非推理能力的差异,而是任务熟悉度的伪影。
Enhanced Drug-drug Interaction Prediction Using Adaptive Knowledge Integration
利用自适应知识集成增强药物-药物相互作用预测
- Authors: Pengfei Liu, Jun Tao, Zhixiang Ren
- Subjects: Subjects:
Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/2603.12885
- Pdf link: https://arxiv.org/pdf/2603.12885
- Abstract
Drug-drug interaction event (DDIE) prediction is crucial for preventing adverse reactions and ensuring optimal therapeutic outcomes. However, existing methods often face challenges with imbalanced datasets, complex interaction mechanisms, and poor generalization to unknown drug combinations. To address these challenges, we propose a knowledge augmentation framework that adaptively infuses prior drug knowledge into a large language model (LLM). This framework utilizes reinforcement learning techniques to facilitate adaptive knowledge extraction and synthesis, thereby efficiently optimizing the strategy space to enhance the accuracy of LLMs for DDIE predictions. As a result of few-shot learning, we achieved a notable improvement compared to the baseline. This approach establishes an effective framework for scientific knowledge learning for DDIE predictions.
- 中文摘要
药物相互作用事件(DDIE)预测对于预防不良反应和确保最佳治疗效果至关重要。然而,现有方法常面临数据集不平衡、复杂的相互作用机制以及对未知药物组合的推广能力较差等挑战。为应对这些挑战,我们提出了一种知识增强框架,能够自适应地将先前的药物知识注入大型语言模型(LLM)。该框架利用强化学习技术促进自适应知识提取与综合,从而高效优化策略空间,提升大型语言模型在DDIE预测中的准确性。通过少数样本学习,我们相较基线取得了显著提升。这种方法为DDIE预测的科学知识学习建立了有效的框架。
Finite Difference Flow Optimization for RL Post-Training of Text-to-Image Models
文本到图像模型的强化学习后训练有限差分流优化
- Authors: David McAllister, Miika Aittala, Tero Karras, Janne Hellsten, Angjoo Kanazawa, Timo Aila, Samuli Laine
- Subjects: Subjects:
Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Machine Learning (stat.ML)
- Arxiv link: https://arxiv.org/abs/2603.12893
- Pdf link: https://arxiv.org/pdf/2603.12893
- Abstract
Reinforcement learning (RL) has become a standard technique for post-training diffusion-based image synthesis models, as it enables learning from reward signals to explicitly improve desirable aspects such as image quality and prompt alignment. In this paper, we propose an online RL variant that reduces the variance in the model updates by sampling paired trajectories and pulling the flow velocity in the direction of the more favorable image. Unlike existing methods that treat each sampling step as a separate policy action, we consider the entire sampling process as a single action. We experiment with both high-quality vision language models and off-the-shelf quality metrics for rewards, and evaluate the outputs using a broad set of metrics. Our method converges faster and yields higher output quality and prompt alignment than previous approaches.
- 中文摘要
强化学习(RL)已成为基于扩散的后期图像合成模型的标准技术,因为它能够从奖励信号中学习,明确提升图像质量和提示对齐等理想方面。本文提出了一种在线强化学习变体,通过抽样成对轨迹并将流速拉向更有利图像方向,降低模型更新的方差。与将每个抽样步骤视为独立策略动作的方法不同,我们将整个抽样过程视为单一行动。我们尝试使用高质量视觉语言模型和现成的奖励质量指标,并利用广泛的指标评估输出。我们的方法收敛速度更快,输出质量和提示对齐度也比以往方法更高。
Thinking in Streaming Video
流媒体视频中的思考
- Authors: Zikang Liu, Longteng Guo, Handong Li, Ru Zhen, Xingjian He, Ruyi Ji, Xiaoming Ren, Yanhao Zhang, Haonan Lu, Jing Liu
- Subjects: Subjects:
Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2603.12938
- Pdf link: https://arxiv.org/pdf/2603.12938
- Abstract
Real-time understanding of continuous video streams is essential for interactive assistants and multimodal agents operating in dynamic environments. However, most existing video reasoning approaches follow a batch paradigm that defers reasoning until the full video context is observed, resulting in high latency and growing computational cost that are incompatible with streaming scenarios. In this paper, we introduce ThinkStream, a framework for streaming video reasoning based on a Watch--Think--Speak paradigm that enables models to incrementally update their understanding as new video observations arrive. At each step, the model performs a short reasoning update and decides whether sufficient evidence has accumulated to produce a response. To support long-horizon streaming, we propose Reasoning-Compressed Streaming Memory (RCSM), which treats intermediate reasoning traces as compact semantic memory that replaces outdated visual tokens while preserving essential context. We further train the model using a Streaming Reinforcement Learning with Verifiable Rewards scheme that aligns incremental reasoning and response timing with the requirements of streaming interaction. Experiments on multiple streaming video benchmarks show that ThinkStream significantly outperforms existing online video models while maintaining low latency and memory usage. Code, models and data will be released at this https URL
- 中文摘要
实时理解连续视频流对于动态环境中的交互式助手和多模态代理至关重要。然而,大多数现有视频推理方法采用批处理范式,推迟推理直到完整视频上下文被观察,导致高延迟和计算成本增加,且与流媒体场景不兼容。本文介绍了ThinkStream,这是一种基于“观看-思考-说话”范式的视频流推理框架,使模型能够随着新的视频观察到来时逐步更新其理解。在每一步,模型都会进行简短的推理更新,并判断是否积累了足够的证据以产生回应。为支持长视野流式流,我们提出了推理压缩流式记忆(RCSM),该方法将中间推理痕迹视为紧凑的语义记忆,替代过时的视觉符号,同时保留关键上下文。我们进一步使用带有可验证奖励的流式强化学习方案训练模型,使增量推理和响应时机与流式交互的需求保持一致。多项流媒体视频基准测试的实验显示,ThinkStream在保持低延迟和低内存占用的同时,显著优于现有在线视频模型。代码、模型和数据将在此 https URL 发布
Efficient Real-World Autonomous Racing via Attenuated Residual Policy Optimization
通过衰减残差策略优化实现高效的现实世界自动驾驶竞速
- Authors: Raphael Trumpp, Denis Hoornaert, Mirco Theile, Marco Caccamo
- Subjects: Subjects:
Robotics (cs.RO); Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2603.12960
- Pdf link: https://arxiv.org/pdf/2603.12960
- Abstract
Residual policy learning (RPL), in which a learned policy refines a static base policy using deep reinforcement learning (DRL), has shown strong performance across various robotic applications. Its effectiveness is particularly evident in autonomous racing, a domain that serves as a challenging benchmark for real-world DRL. However, deploying RPL-based controllers introduces system complexity and increases inference latency. We address this by introducing an extension of RPL named attenuated residual policy optimization ($\alpha$-RPO). Unlike standard RPL, $\alpha$-RPO yields a standalone neural policy by progressively attenuating the base policy, which initially serves to bootstrap learning. Furthermore, this mechanism enables a form of privileged learning, where the base policy is permitted to use sensor modalities not required for final deployment. We design $\alpha$-RPO to integrate seamlessly with PPO, ensuring that the attenuated influence of the base controller is dynamically compensated during policy optimization. We evaluate $\alpha$-RPO by building a framework for 1:10-scaled autonomous racing around it. In both simulation and zero-shot real-world transfer to Roboracer cars, $\alpha$-RPO not only reduces system complexity but also improves driving performance compared to baselines - demonstrating its practicality for robotic deployment. Our code is available at: this https URL.
- 中文摘要
残余策略学习(RPL)通过深度强化学习(DRL)对静态基础策略进行优化,在各种机器人应用中表现出优异的性能。其有效性在自动驾驶赛车中尤为明显,该领域成为现实世界日行车(DRL)的严峻标杆。然而,部署基于RPL的控制器会增加系统复杂性并增加推理延迟。我们通过引入一个名为衰减残差策略优化($\alpha$-RPO)的RPL扩展来解决这个问题。与标准RPL不同,$\alpha$-RPO通过逐步衰减基础策略,产生独立的神经策略,最初用于启动学习。此外,该机制还实现了一种特权学习,即基础策略允许使用最终部署中不需要的传感器模态。我们设计了$\alpha$-RPO以无缝集成PPO,确保基础控制器的衰减影响在策略优化过程中得到动态补偿。我们通过构建一个1:10比例自动驾驶竞速的框架来评估$\alpha$-RPO。在模拟和零点真实世界对Roboracer车辆的转运中,$\alpha$-RPO不仅降低了系统复杂度,还提升了驾驶性能,相较于基准水平——展示了其在机器人部署中的实用性。我们的代码可在以下 https URL 获取。
Long-form RewardBench: Evaluating Reward Models for Long-form Generation
长篇奖励Bench:评估长形式生成的奖励模型
- Authors: Hui Huang, Yancheng He, Wei Liu, Muyun Yang, Jiaheng Liu, Kehai Chen, Bing Xu, Conghui Zhu, Hailong Cao, Tiejun Zhao
- Subjects: Subjects:
Computation and Language (cs.CL)
- Arxiv link: https://arxiv.org/abs/2603.12963
- Pdf link: https://arxiv.org/pdf/2603.12963
- Abstract
The widespread adoption of reinforcement learning-based alignment highlights the growing importance of reward models. Various benchmarks have been built to evaluate reward models in various domains and scenarios. However, a significant gap remains in assessing reward models for long-form generation, despite its critical role in real-world applications. To bridge this, we introduce Long-form RewardBench, the first reward modeling testbed specifically designed for long-form generation. Our benchmark encompasses five key subtasks: QA, RAG, Chat, Writing, and Reasoning. We collected instruction and preference data through a meticulously designed multi-stage data collection process, and conducted extensive experiments on 20+ mainstream reward models, including both classifiers and generative models. Our findings reveal that current models still lack long-form reward modeling capabilities. Furthermore, we designed a novel Long-form Needle-in-a-Haystack Test, which revealed a correlation between reward modeling performance and the error's position within a response, as well as the overall response length, with distinct characteristics observed between classification and generative models. Finally, we demonstrate that classifiers exhibit better generalizability compared to generative models trained on the same data. As the first benchmark for long-form reward modeling, this work aims to offer a robust platform for visualizing progress in this crucial area.
- 中文摘要
基于强化学习的对齐的广泛采用凸显了奖励模型日益重要的地位。已经建立了多种基准测试,以评估不同领域和场景下的奖励模型。然而,尽管长格式生成的奖励模型在实际应用中扮演关键角色,但评估方面仍存在显著空白。为了弥合这一目标,我们推出了Long-form RewardBench,这是首个专门为长格式生成设计的奖励建模测试平台。我们的基准涵盖五个关键子任务:质量保证(QA)、讨论(RAG)、聊天(Chat)、写作(Writing)和推理(Reasoning)。我们通过精心设计的多阶段数据收集过程收集了指令和偏好数据,并在20+主流奖励模型(包括分类器和生成模型)上进行了大量实验。我们的发现显示,当前模型仍然缺乏长形式的奖励建模能力。此外,我们设计了一种新颖的长形式大海捞针测试,揭示了奖励建模表现与错误在反应中的位置以及整体反应长度之间的相关性,且分类模型与生成模型之间观察到明显特征。最后,我们证明了分类器相比在同一数据上训练的生成模型表现出更好的泛化性。作为长形式奖励建模的首个基准,本研究旨在为这一关键领域提供一个稳健的平台,以可视化这一关键领域的进展。
ARL-Tangram: Unleash the Resource Efficiency in Agentic Reinforcement Learning
ARL-Tangram:释放代理强化学习中的资源效率
- Authors: Bangjun Xiao, Yihao Zhao, Xiangwei Deng, Shihua Yu, Yuxing Xiang, Huaqiu Liu, Qiying Wang, Liang Zhao, Hailin Zhang, Xuanzhe Liu, Xin Jin, Fuli Luo
- Subjects: Subjects:
Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/2603.13019
- Pdf link: https://arxiv.org/pdf/2603.13019
- Abstract
Agentic reinforcement learning (RL) has emerged as a transformative workload in cloud clusters, enabling large language models (LLMs) to solve complex problems through interactions with real world. However, unlike traditional RL, agentic RL demands substantial external cloud resources, e.g., CPUs for code execution and GPUs for reward models, that exist outside the primary training cluster. Existing agentic RL framework typically rely on static over-provisioning, i.e., resources are often tied to long-lived trajectories or isolated by tasks, which leads to severe resource inefficiency. We propose the action-level orchestration, and incorporate it into ARL-Tangram, a unified resource management system that enables fine-grained external resource sharing and elasticity. ARL-Tangram utilizes a unified action-level formulation and an elastic scheduling algorithm to minimize action completion time (ACT) while satisfying heterogeneous resource constraints. Further, heterogeneous resource managers are tailored to efficiently support the action-level execution on resources with heterogeneous characteristics and topologies. Evaluation on real-world agentic RL tasks demonstrates that ARL-Tangram improves average ACT by up to 4.3$\times$, speeds up the step duration of RL training by up to 1.5$\times$, and saves the external resources by up to 71.2$\%$. This system has been deployed to support the training of the MiMo series models.
- 中文摘要
代理强化学习(RL)已成为云集群中的一种变革性工作负载,使大型语言模型(LLM)能够通过与现实世界的交互来解决复杂问题。然而,与传统强化学习不同,代理式强化学习需要大量外部云资源,例如用于代码执行的CPU和用于奖励模型的GPU这些资源存在于主训练集群之外。现有的代理型强化学习框架通常依赖静态的过度配置,即资源常被绑定于长寿命轨迹或被任务隔离,导致资源效率极低。我们提出了动作级编排,并将其整合进ARL-Tangram,一个统一的资源管理系统,实现细粒度的外部资源共享和弹性。ARL-Tangram采用统一的动作级表述和弹性调度算法,以最小化动作完成时间(ACT),同时满足异构资源约束。此外,异构资源管理器旨在高效支持具有异构特性和拓扑的资源的动作级执行。对现实世界能动强化学习任务的评估表明,ARL-Tangram 平均 ACT 提升了最多 4.3 美元 \ 倍数,将强化学习的步长缩短了最多 1.5 倍/倍数,并节省了多达 71.2 倍的外部资源。该系统已被部署用于支持MiMo系列模型的训练。
PISmith: Reinforcement Learning-based Red Teaming for Prompt Injection Defenses
PISmith:基于强化学习的红队,用于快速注射防御
- Authors: Chenlong Yin, Runpeng Geng, Yanting Wang, Jinyuan Jia
- Subjects: Subjects:
Machine Learning (cs.LG); Cryptography and Security (cs.CR)
- Arxiv link: https://arxiv.org/abs/2603.13026
- Pdf link: https://arxiv.org/pdf/2603.13026
- Abstract
Prompt injection poses serious security risks to real-world LLM applications, particularly autonomous agents. Although many defenses have been proposed, their robustness against adaptive attacks remains insufficiently evaluated, potentially creating a false sense of security. In this work, we propose PISmith, a reinforcement learning (RL)-based red-teaming framework that systematically assesses existing prompt-injection defenses by training an attack LLM to optimize injected prompts in a practical black-box setting, where the attacker can only query the defended LLM and observe its outputs. We find that directly applying standard GRPO to attack strong defenses leads to sub-optimal performance due to extreme reward sparsity -- most generated injected prompts are blocked by the defense, causing the policy's entropy to collapse before discovering effective attack strategies, while the rare successes cannot be learned effectively. In response, we introduce adaptive entropy regularization and dynamic advantage weighting to sustain exploration and amplify learning from scarce successes. Extensive evaluation on 13 benchmarks demonstrates that state-of-the-art prompt injection defenses remain vulnerable to adaptive attacks. We also compare PISmith with 7 baselines across static, search-based, and RL-based attack categories, showing that PISmith consistently achieves the highest attack success rates. Furthermore, PISmith achieves strong performance in agentic settings on InjecAgent and AgentDojo against both open-source and closed-source LLMs (e.g., GPT-4o-mini and GPT-5-nano). Our code is available at this https URL.
- 中文摘要
即时注入对现实世界的大型语言模型应用,尤其是自主智能体,构成严重的安全风险。尽管提出了许多防御措施,但其对自适应攻击的稳健性评估不足,可能导致虚假的安全感。在本研究中,我们提出了PISmith,这是一个基于强化学习(RL)的红队框架,通过训练攻击型LLM在实际的黑箱环境中优化注入提示,系统性地评估现有提示注入防御,攻击者只能查询被防御的LLM并观察其输出。我们发现,直接将标准GRPO应用于攻击强防御会导致性能不优,因为奖励极度稀疏——大多数生成的注入提示被防御阻挡,导致策略的熵在发现有效攻击策略之前崩溃,而罕见的成功则无法有效学习。为此,我们引入了自适应熵正则化和动态优势加权,以持续探索并放大从稀缺成功中获得的学习。对13个基准的广泛评估表明,最先进的即时注入防御仍易受自适应攻击影响。我们还将PISmith与静态、基于搜索和基于强化学习的7个基线进行了比较,显示PISmith始终保持最高的攻击成功率。此外,PISmith在InjecAgent和AgentDojo的代理环境中,无论是对开源还是闭源LLM(如GPT-4o-mini和GPT-5-nano)都表现出色。我们的代码可在此 https URL 访问。
Mending the Holes: Mitigating Reward Hacking in Reinforcement Learning for Multilingual Translation
修补漏洞:在多语言翻译强化学习中缓解奖励黑客
- Authors: Yifeng Liu, Siqi Ouyang, Yatish Hosmane Revanasiddappa, Lei Li
- Subjects: Subjects:
Computation and Language (cs.CL)
- Arxiv link: https://arxiv.org/abs/2603.13045
- Pdf link: https://arxiv.org/pdf/2603.13045
- Abstract
Large Language Models (LLMs) have demonstrated remarkable capability in machine translation on high-resource language pairs, yet their performance on low-resource translation still lags behind. Existing post-training methods rely heavily on high-quality parallel data, which are often scarce or unavailable for low-resource languages. In this paper, we introduce WALAR, a reinforcement training method using only monolingual text to elevate LLMs' translation capabilities on massive low-resource languages while retaining their performance on high-resource languages. Our key insight is based on the observation of failure modes (or "holes") in existing source-based multilingual quality estimation (QE) models. Reinforcement learning (RL) using these QE models tends to amplify such holes, resulting in poorer multilingual LLMs. We develop techniques including word alignment and language alignment to mitigate such holes in WALAR's reward for RL training. We continually trained an LLM supporting translation of 101 languages using WALAR. The experiments show that our new model outperforms LLaMAX, one of the strongest open-source multilingual LLMs by a large margin on 1400 language directions on Flores-101 dataset.
- 中文摘要
大型语言模型(LLMs)在高资源语言对的机器翻译中展现出了卓越的能力,但在低资源翻译上的性能仍然落后。现有的后训练方法高度依赖高质量的并行数据,而这些数据在低资源语言中往往稀缺或不可得。本文介绍了WALAR,一种仅使用单语文本的强化训练方法,旨在提升大型语言模型在大规模低资源语言上的翻译能力,同时保持其在高资源语言上的表现。我们的关键见解基于对现有基于源的多语言质量估计(QE)模型中失败模式(“漏洞”)的观察。使用这些量子化模型的强化学习(RL)往往放大了这些漏洞,导致多语言LLM表现较差。我们开发了包括词语对齐和语言对齐在内的技术,以弥补WARR在强化学习奖励中的这些漏洞。我们持续训练一个支持101种语言翻译的大型语言模型,使用WALAR。实验显示,我们的新模型在Flores-101数据集的1400个语言方向上,远远优于LLaMAX——开源最强的多语言LLMs之一。
Topo-R1: Detecting Topological Anomalies via Vision-Language Models
Topo-R1:通过视觉-语言模型检测拓扑异常
- Authors: Meilong Xu, Qingqiao Hu, Xiaoling Hu, Shahira Abousamra, Xin Yu, Weimin Lyu, Kehan Qi, Dimitris Samaras, Chao Chen
- Subjects: Subjects:
Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/2603.13054
- Pdf link: https://arxiv.org/pdf/2603.13054
- Abstract
Topological correctness is crucial for tubular structures such as blood vessels, nerve fibers, and road networks. Existing topology-preserving methods rely on domain-specific ground truth, which is costly and rarely transfers across domains. When deployed to a new domain without annotations, a key question arises: how can we detect topological anomalies without ground-truth supervision? We reframe this as topological anomaly detection, a structured visual reasoning task requiring a model to locate and classify topological errors in predicted segmentation masks. Vision-Language Models (VLMs) are natural candidates; however, we find that state-of-the-art VLMs perform nearly at random, lacking the fine-grained, topology-aware perception needed to identify sparse connectivity errors in dense structures. To bridge this gap, we develop an automated data-curation pipeline that synthesizes diverse topological anomalies with verifiable annotations across progressively difficult levels, thereby constructing the first large-scale, multi-domain benchmark for this task. We then introduce Topo-R1, a framework that endows VLMs with topology-aware perception via two-stage training: supervised fine-tuning followed by reinforcement learning with Group Relative Policy Optimization (GRPO). Central to our approach is a topology-aware composite reward that integrates type-aware Hungarian matching for structured error classification, spatial localization scoring, and a centerline Dice (clDice) reward that directly penalizes connectivity disruptions, thereby jointly incentivizing semantic precision and structural fidelity. Extensive experiments demonstrate that Topo-R1 establishes a new paradigm for annotation-free topological quality assessment, consistently outperforming general-purpose VLMs and supervised baselines across all evaluation protocols.
- 中文摘要
拓扑正确性对于血管、神经纤维和道路网络等管状结构至关重要。现有的拓扑保持方法依赖于域特定的地面真实数据,这成本高昂且很少跨域传输。当在没有注释的情况下部署到新领域时,会出现一个关键问题:在没有地面真实监督的情况下,我们如何检测拓扑异常?我们将此重新定义为拓扑异常检测,这是一项结构化的视觉推理任务,需要模型定位并分类预测切片掩码中的拓扑误差。视觉语言模型(VLMs)是自然的候选对象;然而,我们发现最先进的VLM几乎是随机的,缺乏识别密集结构稀疏连接错误所需的细粒度拓扑感知。为弥合这一空白,我们开发了一套自动化数据管理流程,综合多样化的拓扑异常,并在逐步复杂难度递增的层级中附带可验证的注释,从而构建了该任务的首个大规模多域基准测试。随后我们介绍了Topo-R1框架,该框架通过两阶段训练赋予VLM拓扑感知能力:监督微调,随后是基于群体相对策略优化(GRPO)的强化学习。我们方法的核心是一种拓扑感知复合奖励,该奖励整合了类型感知的匈牙利匹配用于结构化错误分类、空间定位评分,以及中心线骰子(clDice)奖励,后者直接惩罚连接中断,从而共同激励语义精度和结构忠实度。大量实验表明,Topo-R1为无注释拓扑质量评估建立了新的范式,在所有评估协议中始终优于通用VLM和监督基线。
Visual-ERM: Reward Modeling for Visual Equivalence
视觉ERM:视觉等效性的奖励建模
- Authors: Ziyu Liu, Shengyuan Ding, Xinyu Fang, Xuanlang Dai, Penghui Yang, Jianze Liang, Jiaqi Wang, Kai Chen, Dahua Lin, Yuhang Zang
- Subjects: Subjects:
Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2603.13224
- Pdf link: https://arxiv.org/pdf/2603.13224
- Abstract
Vision-to-code tasks require models to reconstruct structured visual inputs, such as charts, tables, and SVGs, into executable or structured representations with high visual fidelity. While recent Large Vision Language Models (LVLMs) achieve strong results via supervised fine-tuning, reinforcement learning remains challenging due to misaligned reward signals. Existing rewards either rely on textual rules or coarse visual embedding similarity, both of which fail to capture fine-grained visual discrepancies and are vulnerable to reward hacking. We propose Visual Equivalence Reward Model (Visual-ERM), a multimodal generative reward model that provides fine-grained, interpretable, and task-agnostic feedback to evaluate vision-to-code quality directly in the rendered visual space. Integrated into RL, Visual-ERM improves Qwen3-VL-8B-Instruct by +8.4 on chart-to-code and yields consistent gains on table and SVG parsing (+2.7, +4.1 on average), and further strengthens test-time scaling via reflection and revision. We also introduce VisualCritic-RewardBench (VC-RewardBench), a benchmark for judging fine-grained image-to-image discrepancies on structured visual data, where Visual-ERM at 8B decisively outperforms Qwen3-VL-235B-Instruct and approaches leading closed-source models. Our results suggest that fine-grained visual reward supervision is both necessary and sufficient for vision-to-code RL, regardless of task specificity.
- 中文摘要
视觉转代码任务需要模型将结构化的视觉输入(如图表、表格和SVG)重建为具有高视觉真实度的可执行或结构化表示。尽管最新的大型视觉语言模型(LVLM)通过监督微调取得了显著效果,但由于奖励信号错位,强化学习依然充满挑战。现有的奖励要么依赖文本规则,要么依赖粗略的视觉嵌入相似性,这两者都无法捕捉细微的视觉差异,且容易受到奖励黑客攻击的影响。我们提出了视觉等效奖励模型(Visual-ERM),这是一种多模态生成奖励模型,提供细粒度、可解释且任务无关的反馈,直接评估渲染后的视觉空间中的视觉到代码质量。集成到强化学习中,Visual-ERM 在图表到代码上提升了 Qwen3-VL-8B-Instruct +8.4,并在表格和 SVG 解析方面实现了持续提升(平均为 +2.7,+4.1),并通过反射和修订进一步强化测试时间的缩放。我们还介绍了VisualCritic-RewardBench(VC-RewardBench),这是一个用于判断结构化视觉数据中细粒度图像与图像差异的基准测试,其中8B分辨率的Visual-ERM明显优于Qwen3-VL-235B-Instruct,并接近领先的闭源模型。我们的结果表明,细粒度的视觉奖励监督对于视觉到代码强化学习既必要又充分,无论任务特异性如何。
Keyword: diffusion policy
There is no result