生成时间: 2026-05-25 19:54:58 (UTC+8); Arxiv 发布时间: 2026-05-25 20:00 EDT (2026-05-26 08:00 UTC+8)
今天共有 34 篇相关文章
Keyword: reinforcement learning
FuRA: Full-Rank Parameter-Efficient Fine-Tuning with Spectral Preconditioning
FuRA:全秩参数高效微调与谱预处理
- Authors: Yequan Zhao, Ruijie Zhang, Liyan Tan, Niall Moran, Tong Qin, Zheng Zhang
- Subjects: Subjects:
Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/2605.22869
- Pdf link: https://arxiv.org/pdf/2605.22869
- Abstract
Both full fine-tuning (Full FT) and parameter-efficient fine-tuning methods such as LoRA introduce weight updates without accounting for the spectral structure established during pretraining. As a result, noisy gradients from limited fine-tuning data can perturb robust pretrained features. We identify spectral preconditioning as the missing ingredient: reparameterizing each weight matrix through its full-rank singular value decomposition (SVD) and freezing one singular basis constrains updates to the pretrained column space, yielding a preconditioned optimization scheme that outperforms unconstrained Full FT at the same trainable parameter count. Building on this insight, we propose FuRA (Full-Rank Adaptation), an efficient full-rank adaptation framework based on a block tensor-train factorization W = LSR, where the large core L is fixed to the pretrained block-wise SVD basis, while only the compact core R and the block-wise singular values S are optimized. This design simultaneously provides full-rank spectral preconditioning, preserves full-rank update expressivity, and achieves parameter, memory, and step-time efficiency comparable to LoRA. FuRA consistently outperforms Full FT across multiple settings, including LLM fine-tuning (+1.37 on LLaMA-3-8B commonsense reasoning), LLM reinforcement learning for mathematical reasoning, and visual instruction tuning for VLMs. Furthermore, the 4-bit quantized variant, QFuRA, also surpasses QLoRA. Code is available at this https URL
- 中文摘要
无论是全微调(Full FT)还是参数高效的微调方法(如LoRA),都会引入权重更新,但未考虑预训练时建立的谱结构。因此,来自有限微调数据的噪声梯度会扰动稳健的预训练特征。我们指出谱预条件是缺失的要素:通过全秩奇异值分解(SVD)重新参数化每个权重矩阵,并冻结一个奇异基,可以限制对预训练列空间的更新,从而得到一个预处理优化方案,在相同可训练参数计数下优于无约束全傅里叶变换。基于这一见解,我们提出了FuRA(全秩适应),这是一种基于块张量列分解W = LSR的高效全秩适应框架,其中大核心L固定在预训练的逐块SVD基底上,而仅优化紧致核心R和逐块奇异值S。该设计同时提供全秩频谱预处理,保持全秩更新表达性,并实现与LoRA相当的参数、内存和步长效率。FuRA在多种环境下持续优于全傅立尼,包括LLaMA-3-8B常识推理的LLM微调(+1.37)、数学推理的LLM强化学习,以及VLM的视觉指令调优。此外,4位量子化变体QFuRA也超过了QLoRA。代码可在此 https URL 获取
NeuroNL2LTL: A Neurosymbolic Framework for Natural Language Translation of Linear Temporal Logic
NeuroNL2LTL:一种用于线性时间逻辑自然语言翻译的神经符号框架
- Authors: Paapa Kwesi Quansah, Ernest Bonnah
- Subjects: Subjects:
Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
- Arxiv link: https://arxiv.org/abs/2605.22874
- Pdf link: https://arxiv.org/pdf/2605.22874
- Abstract
Effectively translating between natural language (NL) and formal logics like Linear Temporal Logic (LTL) requires expertise that limits formal verification's reach in safety-critical development. Template-based approaches sacrifice expressiveness for reliability; neural methods achieve fluency but provide no correctness guarantees. We present NeuroNL2LTL, a neurosymbolic architecture unifying learned translation with formal verification. NeuroNL2LTL routes translation through an intermediate representation whose mapping to LTL is structure-preserving by construction. Generated specifications undergo satisfiability and non-triviality checking; a minimal-edit repair mechanism corrects near-miss outputs before they reach downstream tools. The central innovation is verifier-in-the-loop training: verification outcomes serve as reward signals for reinforcement learning, producing neural components that optimize directly for formal correctness. On 200,000+ requirements spanning aerospace, robotics, autonomous vehicles, and ten additional domains, NeuroNL2LTL achieves 28\% semantic equivalence with reference specifications while ensuring 86\% of outputs are verified satisfiable. The system also generates contextually grounded explanations from LTL, enabling domain experts to validate specifications without specialized training. This work demonstrates that formal verification can function as both training objective and runtime filter for neural specification systems, allowing us to build neural-based tools whose reliability derives from logical guarantees rather than statistical confidence.
- 中文摘要
在自然语言(NL)与线性时间逻辑(LTL)等形式逻辑之间有效转换,需要限制形式验证在安全关键开发中覆盖范围的专业知识。基于模板的方法牺牲了表现力以换取可靠性;神经方法能够实现流畅性,但不保证正确性。我们介绍了NeuroNL2LTL,一种神经符号架构,将学习翻译与形式验证统一起来。NeuroNL2LTL 通过一个中间表示路由翻译,该表示通过结构保持结构保持与 LTL 的映射。生成的规范会经过可满足性和非平凡性检查;最小编辑修复机制在差点错过的输出到达下游工具之前进行修正。核心创新是验证者在环训练:验证结果作为强化学习的奖励信号,产生直接优化形式正确性的神经组件。在涵盖航空航天、机器人、自动驾驶车辆及十个其他领域20万+需求时,NeuroNL2LTL实现了28%的语义等效性,同时确保86%的输出被验证为可满足。系统还能从LTL生成基于上下文的解释,使领域专家无需专业培训即可验证规范。这项工作表明,形式验证既可以作为神经规范系统的训练目标过滤器,也能作为运行时过滤器,使我们能够构建基于逻辑保证而非统计置信度的神经工具。
SCRIPT: Scalable Diffusion Policy with Multi-stage Training for Language-driven Physics-Based Humanoid Control
SCRIPT:可扩展扩散政策,支持多阶段训练,用于语言驱动的物理类人控制
- Authors: Jingyan Zhang, Han Liang, Ruichi Zhang, Bin Li, Juze Zhang, Xin Chen, Jingya Wang, Lan Xu, Jingyi Yu
- Subjects: Subjects:
Graphics (cs.GR); Machine Learning (cs.LG); Robotics (cs.RO)
- Arxiv link: https://arxiv.org/abs/2605.22894
- Pdf link: https://arxiv.org/pdf/2605.22894
- Abstract
Controlling physics-based humanoids from natural-language instructions is a critical step toward general-purpose embodied agents. However, existing methods remain constrained by a tension between semantic expressiveness and physical feasibility, often failing to jointly achieve faithful instruction following, high-quality motion, and stable long-horizon control. We propose SCRIPT, a scalable diffusion policy with a multi-stage training framework for language-driven physics-based humanoid control. The core of SCRIPT is a Joint Action-State-Text Diffusion Transformer (JAST-DiT), which represents actions, physical states, and text as dedicated token streams and couples them through joint attention, enabling direct interaction between language semantics and control dynamics. To stabilize autoregressive control, we introduce a nonlinear history conditioning mechanism, which preserves the dense recent context and samples increasingly sparse cues from long-term history. Beyond supervised imitation pre-training, we propose a post-training stage, further improving the performance using Reinforcement Learning with Hybrid Rewards (RLHR). By injecting learnable noise into the flow-sampling process, RLHR effectively improves motion quality and instruction following within closed-loop simulations using hybrid physical feedback and text rewards. Quantitative evaluations demonstrate that SCRIPT outperforms prior state-of-the-art methods, with gains across text alignment, motion quality, and physical realism metrics. Furthermore, scaling studies on the 1200-hour MotionMillion dataset demonstrate consistent performance gains with model scaling, highlighting SCRIPT's robust scalability for large-scale pre-training. Our code will be publicly available for future research.
- 中文摘要
通过自然语言指令控制基于物理的人形生物,是迈向通用具身代理的关键一步。然而,现有方法仍受限于语义表达性和物理可行性的张力,常常未能共同实现忠实的跟随指令、高质量的运动和稳定的长视野控制。我们提出了SCRIPT,一种可扩展的扩散策略,采用多阶段训练框架,用于基于语言驱动的物理类人控制。SCRIPT的核心是一个联合动作-状态-文本扩散变换器(JAST-DiT),它将动作、物理状态和文本表示为专用的令牌流,并通过联合注意力将它们耦合,实现语言语义与控制动态之间的直接交互。为稳定自回归控制,我们引入了非线性历史条件机制,保留密集的近期背景,并采样越来越稀疏的长期历史线索。除了监督模仿的预训练外,我们还提出了一个训练后阶段,通过混合奖励强化学习(RLHR)进一步提升表现。通过向流采样过程注入可学习噪声,RLHR有效提升闭环模拟中的运动质量和指令跟随,采用混合物理反馈和文本奖励。定量评估显示,SCRIPT在文本对齐、运动质量和物理真实性指标方面均优于以往最先进方法。此外,基于1200小时MotionMillion数据集的缩放研究显示,模型缩放持续带来性能提升,凸显了SCRIPT在大规模预训练中的强大可扩展性。我们的代码将公开供未来研究使用。
PIMbot: A Self-Adaptive Attack Framework for Adversarial Manipulation of Multi-Robot Reinforcement Learning
PIMbot:一种用于对抗性操作多机器人强化学习的自适应攻击框架
- Authors: Zexin Li, Ziliang Zhang, Hyoseung Kim, Cong Liu
- Subjects: Subjects:
Robotics (cs.RO)
- Arxiv link: https://arxiv.org/abs/2605.23027
- Pdf link: https://arxiv.org/pdf/2605.23027
- Abstract
Recent research has demonstrated the potential of reinforcement learning in effective multi-robot collaboration, particularly in social dilemmas where robots face a trade-off between self-interest and collective benefits. However, environmental factors such as miscommunication and adversarial robots can impact cooperation, making it crucial to explore how multi-robot communication can be manipulated to achieve different outcomes. This paper presents PIMbot, a framework that manipulates outcomes via two complementary levers: (i) incentive manipulation of the reward channel and (ii) policy manipulation of an agent's own actions. An adaptive multi-objective controller balances these levers in an online manner. Our work introduces a novel approach to manipulation in recent multi-agent RL social dilemmas that utilize a unique reward function for incentivization. By utilizing our proposed PIMbot mechanisms, a robot is able to manipulate the social dilemma environment effectively. Comprehensive experimental results demonstrate the effectiveness of our proposed methods in the Gazebo-simulated multi-robot environment. Moreover, a real embedded device case study on NVIDIA Jetson Orin Nano quantifies system cost and validates PIMbot's effectiveness on realistic autonomous embedded systems scenarios beyond simulation. Together, these results position PIMbot as a rigorous stress-test tool exposing critical vulnerabilities in multi-robot cooperative tasks.
- 中文摘要
最新研究表明,强化学习在多机器人协作中的潜力,尤其是在机器人面临自身利益与集体利益权衡的社会困境中。然而,环境因素如误解和对抗机器人会影响合作,因此探索如何操控多机器人通信以实现不同结果至关重要。本文介绍了PIMbot,这一框架通过两个互补杠杆操控结果:(i)激励操作奖励通道,(ii)策略操控代理自身行为。自适应多目标控制器在线平衡这些杠杆。我们的研究引入了一种新的操作方法,应用于近期多主体强化学习的社会困境,利用独特的奖励函数进行激励。通过利用我们提出的PIMbot机制,机器人能够有效操控社会困境环境。全面的实验结果证明了我们提出的方法在凉亭模拟多机器人环境中的有效性。此外,一项基于NVIDIA Jetson Orin Nano的真实嵌入式设备案例研究量化了系统成本,并验证了PIMbot在模拟之外的真实自主嵌入式系统场景中的有效性。综合这些结果,使PIMbot成为一个严谨的压力测试工具,能够揭示多机器人协作任务中的关键漏洞。
What Training Data Teaches RL Memory Agents: An Empirical Study of Curriculum Effects in Memory-Augmented QA
训练数据教给强化学习记忆代理:记忆增强质量保证中课程影响的实证研究
- Authors: Xinjie He, Zhiyuan Lin, Su Liu, Jialun Wu, Qiyang Xie, Weikai Zhou, Shuai Xiao
- Subjects: Subjects:
Computation and Language (cs.CL)
- Arxiv link: https://arxiv.org/abs/2605.23067
- Pdf link: https://arxiv.org/pdf/2605.23067
- Abstract
Reinforcement learning (RL) has emerged as a viable recipe for training LLM agents to reason over external memory banks in multi-session dialogue. Existing work trains exclusively on a single benchmark, leaving open how the composition of training data shapes the skills a memory agent acquires. We present a controlled empirical study that holds architecture, RL algorithm, and all hyperparameters fixed and varies only the training curriculum across three conditions: in-domain (LoCoMo), mixed-benchmark (LoCoMo + LongMemEval), and out-of-domain (LongMemEval only). Across two benchmarks and ten question types, curriculum composition acts as a fine-grained lever on specialization rather than a uniform scaling factor on performance. The mixed curriculum yields the strongest overall F1 on both evaluation sets. Training on a narrow out-of-domain set transfers a targeted skill - temporal reasoning - despite weak aggregate performance. Per-type differences substantially exceed aggregate differences, indicating that single-number benchmark comparisons systematically underreport curriculum effects. We further report two practical lessons from adapting GRPO to a single-GPU regime: cross-benchmark mixing requires filtering format-specific noise from memory banks to preserve training signal, and binary exact-match reward produces no learning signal at the small group sizes (G = 4) required on one GPU, motivating continuous reward functions in this regime.
- 中文摘要
强化学习(RL)已成为训练LLM代理在多会话对话中通过外部记忆库推理的可行配方。现有工作仅基于单一基准进行训练,未明确训练数据的组成如何影响记忆代理所获得的技能。我们提出了一项受控实证研究,框架、强化学习算法和所有超参数固定,仅在三个条件下调整训练课程:域内(LoCoMo)、混合基准(LoCoMo + LongMemEval)和域外(仅LongMemEval)。在两个基准测试和十种题型中,课程构成更像是对专业化的细致杠杆,而非绩效的统一尺度。混合课程在两个评估集中都获得了最强的整体F1分数。在狭窄的域外集合上训练,尽管整体表现较弱,仍传递了一项目标技能——时间推理。各类型差异远大于总差异,表明单一数字基准比较系统性地低报了课程效果。我们还报告了将GRPO适应到单GPU环境下的两个实际经验:跨基准混合需要从内存库中过滤格式特定的噪声以保留训练信号,而二元精确匹配奖励在单个GPU所需的小组规模(G = 4)下不产生学习信号,从而激励了该模式下的连续奖励函数。
Dreaming Smoothly and Sample Efficiently with Gradient Penalized Latent Dynamics
顺畅做梦,高效采样,采用梯度惩罚潜在动力学
- Authors: Romil V. Sonigra (1), P. R. Kumar (1) ((1) Texas A&M University)
- Subjects: Subjects:
Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2605.23089
- Pdf link: https://arxiv.org/pdf/2605.23089
- Abstract
Model-based reinforcement learning improves sample efficiency by learning a world model. However, existing latent world models such as DreamerV3 do not explicitly enforce local smoothness in their learned transition dynamics, leaving a useful inductive bias for transition dynamics learning unexploited. We propose GPLD, a gradient-penalized latent dynamics regularizer for DreamerV3 that applies a row-wise Jacobian penalty to the posterior latent distribution to encourage locally smooth transition learning. We show that this penalty can be interpreted as the continuous-latent analog of finite-difference smoothing of transition laws in discrete embedded-state MDPs, and estimate it efficiently using Hutchinson-style stochastic probes. Empirically, across DeepMind Control proprioceptive tasks, GPLD improves aggregate sample efficiency, with particularly strong gains on higher-complexity locomotion environments. On more challenging quadruped tasks, GPLD reaches high-return behavior earlier and exhibits more consistent late-stage learning over longer horizons. Explicit local smoothness regularization is a simple and effective way to improve latent world models for smooth continuous control environments. Code for GPLD is available at this http URL .
- 中文摘要
基于模型的强化学习通过学习世界模型提升样本效率。然而,现有的潜在世界模型如 DreamerV3 并未明确强制其学习的转移动力学局部光滑性,导致迁移动力学学习中存在有用的归纳偏差未被充分利用。我们提出了 GPLD,这是一种为 DreamerV3 设计的梯度惩罚潜动力学正则化器,它对后验潜在分布施加逐行雅可比惩罚,以促进局部平滑过渡学习。我们证明,这一惩罚可以被解释为离散嵌入态MDP中过渡律的连续-潜在平滑,并利用哈钦森式随机探针高效估计。在DeepMind Control的本体感觉任务中,GPLD提升了整体样本效率,尤其在高复杂度的运动环境中提升显著。在更具挑战性的四足任务中,GPLD更早达到高回报行为,并在更长时间内表现出更稳定的晚期学习。显式局部光滑性正则化是一种简单且有效的方法,用于改进平滑连续控制环境中的潜在世界模型。GPLD代码可在此http URL获取。
Infra-Bayesian Reinforcement Learning Agents Outperform Classical RL For Worst-Case Robustness
次贝叶斯强化学习代理在最坏情况下的鲁棒性方面优于经典强化学习
- Authors: Manish Aryal, Faiyaz Azam, Agnivo Banerjee, Sai Sidhanth Manoharan Jayanthi, Allegra Laro, Clément Legentilhomme, Andrew Lin, Florian Lorkowski, Radman Rakhshandehroo, Patric Rommel, Emanuel Ruzak, Nathan Theng, Paul Yushin Rapoport
- Subjects: Subjects:
Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2605.23146
- Pdf link: https://arxiv.org/pdf/2605.23146
- Abstract
Classical reinforcement learning assumes the agent interacts with a fixed environment whose behavior does not depend on the agent's policy. This assumption breaks down in non-realizable settings where other actors might anticipate the agent's behavior, including environments crucial to AI safety, where the agent interacts with predictors, humans, other AI agents, and institutions. In such settings, the agent's model class fails to capture the world in which it operates. Under such misspecification, classical Bayesian methods can produce confidently wrong posteriors, unreliable decisions, and unbounded regret, as realizability fails to obtain. Infra-Bayesianism is a decision-theoretic framework that addresses these failures by distinguishing ordinary probabilistic uncertainty, where priors can be reasonably chosen, from Knightian uncertainty, where no grounds exist for the construction of such a prior. It does so by evaluating actions on their worst-case outcomes, rather than from posterior expectations or weighted averaging. We present the first proof-of-concept implementation of an infra-Bayesian reinforcement learning architecture for finite-outcome stateless decision problems. Our agent maintains a set of imprecise hypotheses, updates them using infra-Bayesian conditioning, and selects actions by maximizing worst-case expected value. We apply this implementation of the infra-Bayesian maximin decision process to an environment with Knightian uncertainty, and demonstrate a lower worst-case regret as compared to classical reinforcement learning agents. We also investigate Newcomb's problem and show that the infra-Bayesian agent picks the optimal strategy, outperforming classical decision theory agents. Our results provide a step towards reinforcement learning agents that remain robust under model misspecification and policy-dependent uncertainty.
- 中文摘要
经典强化学习假设智能体与一个固定环境互动,其行为不依赖于智能体策略。这一假设在不可实现的环境中失效,即其他行为者可能预期代理行为的环境,包括对人工智能安全至关重要的环境,代理与预测者、人类、其他人工智能代理和机构互动。在这种情况下,代理的模型类无法捕捉其运行的实际世界。在这种错误描述下,经典贝叶斯方法可能产生自信地错误的后验、不可靠的决策和无界的遗憾,因为实现性无法实现。下贝叶斯主义是一种决策理论框架,通过区分普通概率不确定性(其中可合理选择先验)与奈特不确定性(无理由构造此类先验)来解决这些失败。它通过评估行为的最坏结果,而非基于事后预期或加权平均来实现这一点。我们首次提出了针对有限结果无状态决策问题的贝叶斯下强化学习架构的概念验证实现。我们的智能体维护一组不精确的假设,利用贝叶斯下条件进行更新,并通过最大化最坏情况期望值来选择动作。我们将贝叶斯下极大值决策过程的实现应用于具有奈特不确定性的环境,并展示了与经典强化学习代理相比,最坏情况下的遗憾率更低。我们还研究了纽康布问题,并证明下贝叶斯代理选择最优策略,优于经典决策理论代理。我们的结果为能够在模型错误指定和策略依赖不确定性下保持稳健的强化学习代理迈出了一步。
Pure Exploration for a Good Policy in Reinforcement Learning with Bandit Feedback
纯粹探索强化学习中带有强盗反馈的良好策略
- Authors: Zitian Li, Wang Chi Cheung
- Subjects: Subjects:
Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/2605.23182
- Pdf link: https://arxiv.org/pdf/2605.23182
- Abstract
Pure exploration in episodic Reinforcement Learning has primarily focused on Best Policy Identification (BPI), which seeks to identify a (near)-optimal policy with high confidence. Motivated by practical settings where a ``good enough'' policy suffices, we study an alternate objective of Good Policy Identification (GPI). For a given reward threshold $\mu_0$, GPI only requires identifying a policy with expected reward in an episode at least $\mu_0$ if such a policy exists (positive instance), or declaring None if no such policy exists (negative instance). We formalize GPI under the fixed-confidence setting. We require the output to be correct with probability $\geq 1-\delta$, and seek to minimize the expected sample complexity, which is the expected number of episodes explored for the output. We propose a novel algorithm BEE-GPI, and derive theoretically-grounded upper bounds on its sample complexity for positive and negative instances. Notably, for positive instances, the coefficient of $\log 1/\delta$ in our upper bound is $O(H^2/(V^ - \mu_0)^2)$, where $H$ is the episode length and $V^$ is the optimal expected reward in an episode. The coefficient does not depend on the action and state space sizes otherwise, in sharp contrast to the sample complexity in BPI. We further establish lower bound results to show the near-optimality of BEE-GPI and the necessity of the $1/(V^* -\mu)^2$ term. Numerical experiments further validate the efficiency of our approach.
- 中文摘要
情节强化学习中的纯探索主要聚焦于最佳策略识别(BPI),即以高置信度识别(近)最优策略。基于实际情境中“足够好”的政策,我们研究了另一种目标——良好政策识别(GPI)。对于给定的奖励阈值$\mu_0$,GPI只需在某一集内识别期望奖励至少$\mu_0$的策略(如果存在此类策略,则为正面实例),若无此类策略则声明为无(负面实例)。我们将GPI形式化为固定置信度设置。我们要求输出正确,概率为$\geq 1-\delta$,并尽量最小化预期样本复杂度,即预期的样本复杂度。我们提出了一种新颖的算法BEE-GPI,并在正负实例中推导出理论基础的样本复杂度上界。值得注意的是,对于正值实例,我们上界中的$\log 1/\delta$系数为$O(H^2/(V^ - \mu_0)^2)$,其中$H$为集数长度,$V^$为集数内最优期望奖励。除此之外,系数不依赖于作用空间和状态空间大小,这与 BPI 中的样本复杂度形成鲜明对比。我们进一步建立下界结果,以展示BEE-GPI的近优性以及$1/(V^* -\mu)^2$项的必要性。数值实验进一步验证了我们方法的有效性。
Convex Optimization for Alignment and Preference Learning on a Single GPU
单GPU上的凸优化用于对齐和偏好学习
- Authors: Miria Feng, Mert Pilanci
- Subjects: Subjects:
Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/2605.23244
- Pdf link: https://arxiv.org/pdf/2605.23244
- Abstract
Fine-tuning large language models (LLMs) to align with human preferences has driven the success of systems such as Gemini and ChatGPT. However, approaches like Reinforcement Learning from Human Feedback (RLHF) remain computationally expensive and complex. Direct Preference Optimization (DPO) offers a simpler alternative but has limitations such as inconsistent ranking accuracy, high dependence on GPU resources, and expensive hyperparameter tuning. We propose the Convex Optimization for Alignment and Preference Learning Algorithm (COALA): a novel lightweight strategy with strong theoretical guarantees. By leveraging the convex optimization reformulation of neural networks, COALA eliminates the need for a reference model and obtains significant reduction in both training time and VRAM consumption, thus enabling efficient training on a single GPU. Experiments across four datasets--including a 26621-sample synthetic Educational Feedback dataset--and six models (including Llama-3.1-8B) demonstrate COALA's competitive performance and efficiency while utilizing as little as ~17.6% of DPO's total TFLOPs. COALA exhibits stable, monotonically increasing rewards and reaches peak margins in significantly shorter time in comparison to traditional methods such as DPO and ORPO. To the best of our knowledge, this is the first time convex optimization has been effectively applied to preference fine-tuning of LLMs.
- 中文摘要
微调大型语言模型(LLMs)以符合人类偏好,推动了Gemini和ChatGPT等系统的成功。然而,像人类反馈强化学习(RLHF)这样的方法仍然计算成本高且结构复杂。直接偏好优化(DPO)提供了一种更简单的替代方案,但存在诸如排名准确性不一致、对GPU资源高度依赖以及昂贵的超参数调优等局限性。我们提出了凸优化对齐与偏好学习算法(COALA):一种具有强大理论保证的新型轻量级策略。通过利用神经网络的凸优化重构,COALA消除了对参考模型的需求,显著减少了训练时间和显存消耗,从而实现了在单一GPU上高效的训练。涵盖四个数据集的实验——包括一个包含26621样本的合成教育反馈数据集——以及六个模型(包括Llama-3.1-8B),展示了COALA在利用DPO总TFLOPs中低至~17.6%的同时,具备竞争性能和效率。与传统方法如DPO和ORPO相比,COALA表现出稳定且单调的奖励增长,并且在显著更短的时间内达到利润高峰。据我们所知,这是首次将凸优化有效应用于LLM的偏好微调。
EvalVerse: Pipeline-Aware and Expert-Calibrated Benchmarking for Professional Cinematic Video Generation
EvalVerse:专业电影视频生成的流水线感知和专家校准基准
- Authors: Songlin Yang, Haobin Zhong, Ruilin Zhang, Xiaotong Zhao, Shuai Li, Kai Zheng, Xuyi Yang, Zhe Wang, Zhenchen Tang, Yang Li, Bohai Gu, Zhengwei Peng, Yidan Huang, Mengzhou Luo, Yihang Bo, Dalu Feng, Yujia Zhang, Juntao Ma, Ruiqi Wang, Lvmin Zhang, Yuwei Guo, Frank Guan, Maneesh Agrawala, Hongbo Fu, Alan Zhao, Anyi Rao
- Subjects: Subjects:
Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2605.23271
- Pdf link: https://arxiv.org/pdf/2605.23271
- Abstract
The rapid evolution of generative video foundation models has propelled the field toward professional-grade cinematic synthesis. To achieve such demanding quality, the community transitions towards Reinforcement Learning (RL) and agentic workflows. However, reliable evaluation has emerged as a critical bottleneck. Existing benchmarks predominantly evaluate ''whether it is right'' (basic prompt-following) while fundamentally neglecting ''whether it is good'' (cinematic quality, acting, and aesthetics). Furthermore, current automated metrics lack the domain-specific rigor required to provide trustworthy signals, creating a severe credibility gap between human aesthetic perception and machine scoring. To bridge this gap, we introduce EvalVerse, a comprehensive, pipeline-aware, and expert-calibrated evaluation framework. We treat video generation assessment not merely as an engineering task, but as a core scientific problem: the systematic digitization of subjective cinematic expertise. First, we organize domain knowledge into an evaluation taxonomy aligned with the professional filmmaking workflow (pre-production, production, and post-production). Second, we distill human expert judgments into a curated dataset with large-scale human annotations. Third, we inject this knowledge into Vision-Language Models (VLMs) through an expert-calibrated fine-tuning strategy, enabling the VLM to perform explicit Chain-of-Thought reasoning. Compared to previous works, EvalVerse not only retains compatibility with foundational ''rightness'' metrics, but also significantly expands the criteria to ''goodness'' and broaden the task coverage to complex multi-shot sequencing and audio-visual integration. Consequently, by providing granular diagnostic signals, EvalVerse transcends a static leaderboard and establishes a fundamental infrastructure for future work, such as reward models and evaluator agent.
- 中文摘要
生成视频基础模型的快速演进推动了该领域向专业级电影综合迈进。为了实现如此苛刻的质量,社区转向强化学习(RL)和代理式工作流。然而,可靠的评估已成为关键瓶颈。现有的基准主要评估“是否正确”(基本的提示遵循),而根本忽视了“是否好”(电影质量、表演和美学)。此外,当前自动化指标缺乏提供可信信号所需的领域特定严谨性,导致人类审美感知与机器评分之间存在严重的可信度差距。为弥合这一差距,我们引入了EvalVerse,一个全面、关注流水线且由专家校准的评估框架。我们不仅将视频生成评估视为工程任务,更视为核心科学问题:主观电影专业的系统数字化。首先,我们将领域知识组织成与专业电影制作工作流程(前期制作、制作和后期制作)相匹配的评估分类法。其次,我们将人类专家的判断提炼成一个经过策划的数据集,并配有大规模的人类注释。第三,我们将这些知识注入视觉语言模型(VLM),通过专家校准的微调策略,使VLM能够进行显式的思维链推理。与以往作品相比,EvalVerse不仅保持了基础性的“正确性”指标兼容性,还显著扩展了“良好”标准,并将任务覆盖范围扩大到复杂的多镜头序列和视听集成。因此,通过提供细致的诊断信号,EvalVerse超越了静态排行榜,建立了未来工作的基本基础设施,如奖励模型和评估代理。
Reinforcement Learning for Microcanonical Graph Ensemble with Assortativity Constraints
具有排序约束的微正则图系综强化学习
- Authors: Hoyun Choi, Junghyo Jo, Deok-Sun Lee
- Subjects: Subjects:
Machine Learning (cs.LG); Statistical Mechanics (cond-mat.stat-mech); Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2605.23285
- Pdf link: https://arxiv.org/pdf/2605.23285
- Abstract
How network structure determines function is a fundamental question, and it can be investigated by graph ensembles with precisely controlled structural properties. Canonical approaches, formulated as exponential random graph models (ERGMs), enforce constraints only in expectation, allowing individual realizations to fluctuate around the target. Conversely, microcanonical ensembles impose hard constraints exactly, but practical sampling methods beyond fixing the degree sequence have remained out of reach. Here we introduce the Deep Microcanonical Graph Generator (DMGG), a reinforcement learning (RL) framework that transforms any given graph through degree-preserving rewirings to exactly reach a prescribed assortativity, which characterizes the degree--degree correlation of adjacent nodes. Instead of relying on the entropically dominated Metropolis--Hastings dynamics of the ERGM, DMGG employs a policy-guided search that maximally alters the joint-degree matrix. This eliminates exhaustive parameter tuning and accelerates generation by at least an order of magnitude while preserving configurational diversity. As DMGG generalizes across various graph sizes, sparsities, and topologies, it provides exact null models that allow for the quantitative isolation of secondary observables, such as the clustering coefficient. These results establish RL as a practical and powerful paradigm for generating hard-constrained graphs, opening avenues to investigate structure-function relationships free from ensemble artifacts.
- 中文摘要
网络结构如何决定功能是一个基本问题,且可以通过具有精确控制结构属性的图系综来研究。典型方法,称为指数随机图模型(ERGM),仅在期望值下强制约束,允许单个实现围绕目标波动。相反,微正则系综施加严格约束,但除了固定度数序列外的实际采样方法仍然难以实现。这里我们介绍深度微正则图生成器(DMGG),这是一种强化学习(RL)框架,通过保持度数的重布线将任意给定的图转换为精确达到规定的排序性,该排序表征相邻节点的度-相关性。DMGG没有依赖ERGM中熵支配的Metropolis-Hastings动态,而是采用策略引导搜索,最大化改变联合度矩阵。这消除了穷尽参数调优,并在保持构型多样性的同时,至少加快了一个数量级的生成速度。随着DMGG在不同图大小、稀疏度和拓扑结构上的推广,它提供了精确的零模型,使次级可观测量(如聚类系数)能够定量隔离。这些结果确立了强化学习作为生成硬约束图的实用且强大的范式,为研究无集合伪影的结构-功能关系打开了大门。
Human-in-the-Loop Multi-Agent Ventilator Decision Support with Contextual Bandit Preference Learning
人机环路多智能体呼吸机决策支持,支持情境盗贼偏好学习
- Authors: Sijia Li, Xiaoyu Tan, Qixing Wang, Weiyi Zhao, Chen Zhan, Teqi Hao, Xuemin Wang, Lei Gu, Roland Eils, Xihe Qiu
- Subjects: Subjects:
Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2605.23320
- Pdf link: https://arxiv.org/pdf/2605.23320
- Abstract
Ventilator decision support requires sequential decisions that track evolving physiology and disease trajectories while respecting safety boundaries and clinician specific tuning styles. Rule based approaches rarely generalize personalization, and end to end reinforcement learning or single large language model systems remain difficult to control and audit. We propose the Ventilator Decision Support System (VDSS), a human in the loop multi agent framework that coordinates modular decision components through contract driven structured interfaces and produces traceable evidence for review. VDSS performs online preference adaptation with a contextual bandit, updating clinician specific preferences from the final accepted decision at each adjustment cycle and using them to guide subsequent recommendations. Structured rejection feedback triggers targeted replanning to reduce unproductive iterations and improve interaction stability. Retrospective ICU trajectory replay with expert review indicates higher recommendation acceptability and fewer interaction rounds to reach an acceptable plan, supporting clinically deployable human AI collaboration.
- 中文摘要
呼吸机决策支持需要连续决策,跟踪不断变化的生理和疾病轨迹,同时尊重安全界限和临床医生的个性化调校风格。基于规则的方法很少能推广个性化,端到端强化学习或单一大型语言模型系统仍然难以控制和审计。我们提出了呼吸机决策支持系统(VDSS),这是一个人在环路的多智能体框架,通过合同驱动的结构化接口协调模块化决策组件,并产生可追溯的证据供审查。VDSS通过上下文工具进行在线偏好调整,更新每个调整周期最终接受决定的临床医生偏好,并以此指导后续建议。结构化拒绝反馈会触发有针对性的重新规划,以减少无效迭代并提升交互稳定性。经过专家评审的回顾性ICU轨迹回放显示,推荐度更高,且互动轮次减少,从而支持临床可部署的人类AI协作。
Score-Based One-step MeanFlow Policy Optimization
基于评分的一步均值流策略优化
- Authors: Kyungyoon Kim, Donghyeon Ki, Hee-Jun Ahn, Byung-Jun Lee
- Subjects: Subjects:
Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2605.23365
- Pdf link: https://arxiv.org/pdf/2605.23365
- Abstract
Diffusion and flow matching have emerged as expressive policy classes in reinforcement learning, but their reliance on multi-step denoising imposes substantial computational overhead at inference time, which is particularly problematic in online RL. MeanFlow offers a promising alternative by learning an average velocity field that maps noise to data in a single network evaluation. However, MeanFlow typically requires samples from the target distribution to construct its target velocity field, which are unavailable in online RL. We propose Score-Based One-step MeanFlow Policy Optimization (SOM), an actor-critic algorithm that resolves this by constructing the target velocity field directly from the Q-function via score estimation and a probability flow ODE, thereby concentrating probability mass on high-value modes. In the fully online RL setting, SOM achieves state-of-the-art performance on locomotion tasks with a single generation step, while substantially reducing both training and inference time compared to prior diffusion- and flow-matching-based policies.
- 中文摘要
扩散和流匹配已成为强化学习中的表达策略类,但它们依赖多步去噪在推理时带来了巨大的计算开销,这在在线强化学习中尤为棘手。MeanFlow提供了一个有前景的替代方案,通过学习平均速度场,将噪声映射为单次网络评估的数据。然而,平均流通常需要来自目标分布的样本来构建其目标速度场,而这些样本在在线强化学习中不可得。我们提出了基于分数的一步均值流策略优化(SOM),这是一种演员-批判者算法,通过分数估计和概率流常微分方程,直接从Q函数构建目标速度场,从而将概率质量集中在高价值模式中。在完全在线的强化学习环境中,SOM通过单代步骤实现了最先进的运动任务性能,同时相比之前基于扩散和流量匹配的策略,显著缩短了训练和推断时间。
Curriculum reinforcement learning with measurable task representation learning
课程强化学习与可测量任务表示学习
- Authors: Yongyan Wen, Siyuan Li, Mingjian Fu, Yiqin Yang, Xun Wang, Peng Liu
- Subjects: Subjects:
Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2605.23372
- Pdf link: https://arxiv.org/pdf/2605.23372
- Abstract
In curriculum reinforcement learning (CRL), an agent incrementally accumulates knowledge over a sequence of tasks (i.e., a curriculum), and the learning process is aimed at using the accumulated knowledge to finally solve a challenging target task. While early CRL works focus on sequencing candidate tasks, recent research explores automatic curriculum generation. Among the rich CRL literature, the interpolation-based CRL paradigm is a main body, which automatically generates intermediate tasks by interpolating between the initial task distribution and the target task distribution in task space with meaningful distance metrics (i.e., can measure the task similarity). However, in challenging navigation tasks, the non-Euclidean context (task) space invalidates this assumption. To achieve automatic curriculum generation in complex task, we propose a novel automatic curriculum generation approach based on measurable task representation learning. To better measure the similarity, we propose to transform the task space to a latent space. Through a variational autoencoder structure that encodes the reward and the state transitions, we achieve a latent task representation with a task similarity measurement property, and two close task embeddings correspond to two similar tasks in terms of rewards and state transitions. Based on the learned task representation, we further develop an automatic curriculum generation scheme, which can effectively generate new tasks more and more similar to the target task. We evaluate our method in a variety of challenging navigation tasks, and the experiment results indicate that the proposed approach surpasses state-of-the-art CRL approaches based on interpolation and generative adversarial networks.
- 中文摘要
在课程强化学习(CRL)中,代理通过一系列任务(即课程)逐步积累知识,学习过程旨在利用积累的知识最终解决具有挑战性的目标任务。早期CRL的工作侧重于候选任务的排序,但近期研究探讨了自动课程生成。在丰富的CRL文献中,基于插值的CRL范式是一个主体,通过插值初始任务分布与目标任务分布在任务空间中,并具有有意义的距离度量(即可以衡量任务相似性)自动生成中间任务。然而,在具有挑战性的导航任务中,非欧几里得上下文(任务)空间否定了这一假设。为了实现复杂任务的自动课程生成,我们提出了一种基于可测量任务表示学习的新颖自动课程生成方法。为了更好地测量相似性,我们建议将任务空间转换为潜在空间。通过变分自编码结构编码奖励和状态转移,我们实现了具有任务相似度测量性质的潜在任务表示,两个紧密任务嵌入对应两个相似任务,分别在奖励和状态转移方面。基于已学到的任务表示,我们进一步开发了自动课程生成方案,能够有效地生成越来越接近目标任务的新任务。我们在多种具有挑战性的导航任务中评估了该方法,实验结果表明,所提方法优于基于插值和生成对抗网络的先进CRL方法。
From Correctness to Preference: A Framework for Personalized Agentic Reinforcement Learning
从正确到偏好:个性化能动强化学习框架
- Authors: Ranxu zhang, zeyang li, Jiacheng Huang, Rui Zhang, Xiaozhou Xu, sun zhe, Yanyong Zhang, Chao Wang
- Subjects: Subjects:
Computation and Language (cs.CL)
- Arxiv link: https://arxiv.org/abs/2605.23382
- Pdf link: https://arxiv.org/pdf/2605.23382
- Abstract
Agentic reinforcement learning (Agentic RL) has achieved strong progress in tasks with clear success signals. However, many real-world agent applications require user-conditioned behavior: the same query may call for different planning strategies and tool-use decisions across users. This setting raises key challenges: generic rewards cannot capture heterogeneous user preferences, observed behaviors are entangled with conformity effects, and flat memories cannot support personalized skill retrieval. To this end, we propose a unified personalized Agentic RL framework that embeds personalization into training-time optimization. At its core is \emph{Personalized Anchor Reward-Decoupled Policy Optimization} (\textbf{PARPO}), which decouples generic task-quality rewards from personalized preference rewards and uses user-specific anchors to stabilize learning under heterogeneous reward scales. We further introduce a two-stage preference-disentangled reward model and \emph{Preference-Aligned Skill Evolution Graph Memory} (\textbf{PSGM}) for personalized supervision and preference-aligned skill retrieval. Together, they form a closed loop of preference identification, policy optimization, and structured skill accumulation. Experiments on ETAPP, ETAPP-Hard, and SJAgent show that our framework consistently outperforms strong memory and RL baselines. Code and data are included in the supplementary materials.
- 中文摘要
智能强化学习(Agentic Reforcement learning,简称Agentic RL)在具有明确成功信号的任务中取得了显著进展。然而,许多现实中的代理应用需要用户条件行为:同一查询可能要求不同用户的规划策略和工具使用决策。这一设定带来了关键挑战:通用奖励无法捕捉异质用户偏好,观察到的行为与一致性效应纠缠在一起,扁平记忆无法支持个性化技能检索。为此,我们提出了一个统一的个性化能动强化学习框架,将个性化嵌入训练时间优化中。其核心是\emph{个性化锚奖励-解耦策略优化}(\textbf{PARPO}),它将通用任务质量奖励与个性化偏好奖励解耦,并利用用户特定锚点在异构奖励尺度下稳定学习。我们还进一步引入了两阶段偏好解缠奖励模型和\emph{偏好对齐技能演化图记忆}(\textbf{PSGM}),用于个性化监督和偏好对齐技能检索。它们共同形成了一个偏好识别、策略优化和结构化技能积累的闭环。在ETAPP、ETAPP-Hard和SJAgent上的实验表明,我们的框架始终优于强记忆和强化学习基线。代码和数据包含在补充材料中。
Metacognition as Reward: Reinforcing LLM Reasoning via Knowledge and Regulation Signals
作为奖励的元认知:通过知识和调节信号强化LLM推理
- Authors: Sirui Chen, Lei Xu, Yuying Zhao, Yutian Chen, Yu Wang, Beier Zhu, Hanwang Zhang, Shengjie Zhao, Chaochao Lu
- Subjects: Subjects:
Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2605.23384
- Pdf link: https://arxiv.org/pdf/2605.23384
- Abstract
Recent RL methods have substantially improved the reasoning abilities of LLMs. Existing reward designs mainly follow two paradigms: (1) Reinforcement learning with verifiable rewards (RLVR) derives outcome signals from executable checks or ground-truth answers, but provides limited guidance for intermediate reasoning behaviors. (2) Rubrics-as-reward (RaR) goes beyond final-answer checking by using natural-language rubrics to assess reasoning quality and task compliance, but often requires instance-specific rubrics and substantial design effort. To address these issues, we introduce Metacognition-as-Reward (MaR), a metacognition-inspired RL framework that guides LLM reasoning through two general process dimensions: i) metacognitive knowledge, which identifies task-relevant information without hand-crafted instance-specific rubrics, and ii) metacognitive regulation, which plans and adjusts the reasoning process to provide reward guidance beyond final-answer outcomes. MaR scaffolds model rollouts into explicit metacognitive components and optimizes them with a trajectory-level reward over task knowledge coverage, regulation fidelity, and final-answer correctness. In this way, MaR extends reward feedback to reasoning trajectories while grounding the reward signals in general metacognitive dimensions. Experiments on 22 benchmarks show that MaR consistently improves model performance, achieving up to a 7.7% gain over the base model and up to an 11.0% gain over vanilla DAPO. Notably, Qwen3.5-9B + MaR narrows the gap to frontier models, surpassing GPT-OSS-120B on overall average and outperforming stronger models on several individual benchmarks. Process-level analysis further shows substantial improvements in reasoning process quality. MaR also generalizes to out-of-domain datasets, where MaR-trained models improve over their corresponding base models on average.
- 中文摘要
近期强化学习方法显著提升了大型语言模型的推理能力。现有的奖励设计主要遵循两种范式:(1)带可验证奖励的强化学习(RLVR)从可执行检查或真实答案中提取结果信号,但对中间推理行为提供有限指导。(2)评分标准作为奖励(RaR)超越最终答案检查,使用自然语言评分标准评估推理质量和任务合规性,但通常需要针对实例的评分标准和大量设计工作。为解决这些问题,我们引入了元认知即奖励(MaR),这是一个受元认知启发的强化学习框架,通过两个一般过程维度指导LLM推理:i)元认知知识,识别任务相关信息,无需手工定制的实例特定评分标准;ii)元认知调节,规划和调整推理过程,提供超越最终答案的奖励指导。MaR将模型推广框架化为显式元认知组件,并通过轨迹级奖励优化任务知识覆盖率、调控准确度和最终答案正确性。通过这种方式,MaR将奖励反馈扩展到推理轨迹,同时将奖励信号置于一般元认知维度。在22个基准测试上的实验显示,MaR持续提升模型性能,较基础模型提升最高7.7%,较原版DAPO提升11.0%。值得注意的是,Qwen3.5-9B + MaR缩小了与前沿模型的差距,整体平均超过GPT-OSS-120B,并在多个基准测试中优于更强模型。过程层面分析进一步显示,过程质量推理有显著提升。MaR也推广到域外数据集,其中MaR训练的模型平均优于其对应的基础模型。
Droneulator: A Portable UAV Simulator for Agricultural Workflows with RotorPy and Godot 4
无人机:一款便携式无人机模拟器,用于农业工作流程,支持RotorPy和Godot 4
- Authors: Jacob Swindell, Michael Lowen, Marija Popovic, Riccardo Polvara
- Subjects: Subjects:
Robotics (cs.RO)
- Arxiv link: https://arxiv.org/abs/2605.23386
- Pdf link: https://arxiv.org/pdf/2605.23386
- Abstract
Agricultural UAV research requires simulators that integrate realistic 3D scenes, high-fidelity vehicle dynamics, and robotics middleware, while remaining practical to deploy across heterogeneous development machines. We present Droneulator, a portable UAV simulator architecture that combines RotorPy for multirotor dynamics with Godot 4 for rendering and sensor generation. Droneulator exposes both PX4-based control and a lightweight WebSocket command path, and publishes synchronised visual and state streams through a Zenoh-based ROS~2-compatible pipeline. This integration enables a single stack to support inspection-oriented data capture, ROS~2/PX4 local planning, and reinforcement learning experiments without modifying the simulator infrastructure. We present quantified validation of the current system across three agricultural UAV workflows: tree-scale image collection for 3D reconstruction with COLMAP, local planning around canopy obstacles using EGO-Planner, and closed-loop reinforcement learning through a custom Gymnasium environment. In the reported setup, the results show that the simulator can sustain low-latency sensing, support reconstruction-oriented data collection under varying capture density, execute collision-free local planning around canopy obstacles, and support stable depth-sensing-based policy training for obstacle-aware navigation. Together, these results show the potential of Droneulator for agricultural UAV inspection, planning, and learning within one deployable stack.
- 中文摘要
农业无人机研究需要集成真实3D场景、高精度车辆动力学和机器人中间件的模拟器,同时又能在异构开发机器上实现实用性。我们介绍Droneulator,一种便携式无人机模拟架构,结合了RotorPy用于多旋翼动力学和Godot 4渲染和传感器生成。Droneulator 既基于 PX4 的控制,也支持轻量级 WebSocket 命令路径,并通过基于 Zenoh 的兼容 ROS~2 流水线发布同步的可视化和状态流。这种集成使单一栈能够支持面向检测的数据采集、ROS~2/PX4本地规划和强化学习实验,而无需修改模拟器基础设施。我们通过三种农业无人机工作流程对当前系统的定量验证:利用COLMAP进行树级图像采集,利用EGO-Planner进行树冠障碍的局部规划,以及通过定制体育馆环境进行闭环强化学习。在报告的设置中,结果显示模拟器能够维持低延迟的传感,支持在不同捕获密度下的重建导向数据收集,执行围绕树冠障碍的无碰撞局部规划,并支持基于深度感知的稳定策略训练以实现障碍物感知导航。这些结果共同展示了Droneulator在农业无人机检查、规划和学习中可部署的整体平台中的潜力。
Reflex: Reinforcement Learning with Reflection Symmetry Exploitation in State-Based Continuous Control
反射:基于状态的连续控制中利用反射对称性的强化学习
- Authors: Shuai Zhen, Yifan Zhang, Yuling Wang, Yanhua Yu
- Subjects: Subjects:
Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2605.23415
- Pdf link: https://arxiv.org/pdf/2605.23415
- Abstract
Reinforcement learning has long struggled with poor sample efficiency. One promising approach to mitigate this problem is leveraging group-invariant Markov Decision Processes ($G$-invariant MDPs). Existing works in this direction have primarily focused on image-based RL and rotational symmetry such as $\mathrm{SO(2)}$, leaving state-based RL and reflection symmetry largely underexplored. In this work, we focus on state-based continuous control tasks and exploit reflection symmetry by introducing Reflex, a paradigm that seamlessly integrates with both on-policy and off-policy RL algorithms. We formalize two types of reflection-axial reflection and bilateral reflection, and characterize their corresponding transformations. Building on a theoretical analysis of symmetry-preserving optimal value functions and policies, Reflex integrates reflection symmetry into policy learning through principled symmetry regularization mechanisms. We integrate Reflex with PPO and SAC, and evaluate it on a suite of OpenAI Gym and DeepMind Control benchmarks, demonstrating superior performance over standard baselines while improving sample efficiency. Our code is available at this https URL.
- 中文摘要
强化学习长期以来一直面临样本效率较差的问题。缓解这一问题的一个有前景的方法是利用群不变马尔可夫决策过程($G$-不变MDP)。该方向的现有研究主要聚焦于基于图像的强化学习和旋转对称性,如 $\mathrm{SO(2)}),因此基于状态的强化学习和反射对称性大多未被充分探讨。在本研究中,我们聚焦于基于状态的连续控制任务,并通过引入Reflex这一能够无缝集成策略和非策略强化学习算法的范式,利用反射对称性。我们形式化了两种反射类型——轴向反射和双侧反射,并表征它们对应的变换。基于对称性保持的最优价值函数和策略的理论分析,Reflex通过原则对称正则化机制将反射对称性整合进策略学习。我们将Reflex与PPO和SAC集成,并在一套OpenAI Gym和DeepMind Control基准测试中进行评估,展示出优于标准基线的性能,同时提升样本效率。我们的代码可在此 https URL 访问。
ARES: Automated Rubric Synthesis for Scalable LLM Reinforcement Learning
ARES:可扩展大型语言模型强化学习的自动化评分标准综合
- Authors: Xiaoyuan Li, Keqin Bao, Moxin Li, Yubo Ma, Yichang Zhang, Wenjie Wang, Fuli Feng, Dayiheng Liu
- Subjects: Subjects:
Computation and Language (cs.CL)
- Arxiv link: https://arxiv.org/abs/2605.23454
- Pdf link: https://arxiv.org/pdf/2605.23454
- Abstract
Rubric-based rewards offer a promising way to extend reinforcement learning (RL) for large language models beyond tasks with automatically verifiable answers. However, scaling rubric-based RL remains challenging: existing approaches often rely on expert-written rubrics and manually constructed question sets, while fixed task-level rubrics may fail to capture the evaluation requirements of individual questions. We propose ARES (Automated Rubric synthEsis for Scalable RL), a framework for automatically constructing rubric-based RL data at scale. Starting from raw pretraining documents, ARES converts source knowledge into self-contained question-answer pairs and co-generates question-specific weighted rubrics, enabling instance-level reward supervision for open-ended responses. To improve diversity and quality, ARES conditions generation on domain labels and persona information, and applies validation filters for question self-containment, answer faithfulness, and rubric validity. Using ARES, we construct 100K rubric-annotated instances across ten domains. Experiments on seven benchmarks show that rubric-based RL trained with ARES, outperforms continual pretraining, supervised fine-tuning, and binary-reward RL, with the largest gains on multi-dimensional open-ended tasks such as healthcare and instruction following.
- 中文摘要
基于评分标准的奖励为大型语言模型扩展强化学习(RL)提供了一种有前景的方式,超越了具有自动验证答案的任务。然而,基于评分标准的强化学习仍然具有挑战性:现有方法通常依赖专家编写的评分标准和手动构建的题组,而固定的任务级评分标准可能无法捕捉单个问题的评估需求。我们提出了ARES(自动化评分标准合成系统用于可扩展强化学习)框架,用于大规模自动构建基于评分标准的强化学习数据。从原始的预训练文档开始,ARES 将源知识转换为自包含的问题-答案对,并共同生成针对问题的加权评分标准,实现对开放式回答的实例级奖励监督。为提升多样性和质量,ARES 对生成领域标签和人物信息进行条件,并应用验证过滤器以实现问题自包含性、答案忠实性和评分标准效度。利用ARES,我们构建了跨十个领域、10万个带评分标准注释的实例。七个基准测试的实验显示,基于评分标准的强化学习(RL)在使用ARES训练后,优于持续预训练、监督微调和二元奖励强化学习,在医疗和指令跟随等多维开放式任务上获得最大收益。
CoSPlay: Cooperative Self-Play at Test-Time with Self-Generated Code and Unit Test
CoSplay:测试时的协作式自开发代码和单元测试
- Authors: Zhangyi Hu, Chenhui Liu, Tian Huang, Jindong Li, Yang Yang, Jiemin Wu, Zining Zhong, Menglin Yang, Yutao Yue
- Subjects: Subjects:
Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
- Arxiv link: https://arxiv.org/abs/2605.23491
- Pdf link: https://arxiv.org/pdf/2605.23491
- Abstract
Recently, Reinforcement Learning with Verifiable Rewards (RLVR) and Test-Time Scaling (TTS) have advanced LLM code generation through executable verification. Yet Ground-Truth Unit Tests (GT UTs) remain a bottleneck: SOTA RLVR methods require them for costly training, while existing TTS methods lose competitiveness without them. This motivates GT-free TTS, where existing methods directly use self-generated UTs to refine and select code candidates. Yet such UTs are often noisy or spuriously coupled with wrong code, and UT quality in turn cannot be validated without reliable code. The key challenge is therefore to jointly improve both. To this end, we present CoSPlay, a GT-free, training-free framework that jointly improves codes and UTs through cooperative self-play. It first explores diverse solution ideas and identifies their potential failure modes to produce discriminative UT ideas. It then uses bidirectional pass-count signals from the Code-UT execution matrix to iteratively prune or fix weak codes and refresh or replace unreliable UTs, letting the two pools co-evolve. Finally, when multiple codes remain tied at the highest pass count, it picks the final code from the largest output-consensus cluster, since correct codes agree on the same inputs while wrong codes diverge. Experiments on four challenging benchmarks show that CoSPlay on Qwen2.5-7B-Instruct improves average BoN from 22.1% to 33.2% and UT accuracy from 14.6% to 78.3%, matching or surpassing the RLVR model CURE-7B. When applied to CURE-7B, it further improves BoN by 5.7%. CoSPlay also generalizes across diverse backbones and outperforms GT-free TTS baselines under comparable token budgets, with continued gains as the budget scales up. These results suggest a scalable inference strategy for competitive code generation without any GT data.
- 中文摘要
近年来,带可验证奖励的强化学习(RLVR)和测试时间缩放(TTS)通过可执行验证推动了LLM代码生成的进步。然而,地面真实单元测试(GT UT)依然是瓶颈:SOTA RLVR方法需要它们进行昂贵的培训,而现有TTS方法没有它们则会失去竞争力。这推动了无GT的TTS,现有方法直接使用自生成的UT来精炼和选择代码候选。然而,这些UT往往噪声较大,或者与错误的代码耦合,而UT质量又无法在没有可靠代码的情况下验证。因此,关键挑战是共同改善两者。为此,我们推出了CoSPlay,一个无GT、无训练的框架,通过合作自玩共同改进代码和UT。它首先探讨了各种解决方案的想法,并识别它们可能产生歧视性UT想法的失败模式。然后利用Code-UT执行矩阵的双向通关计数信号,迭代修剪或修正弱代码,刷新或替换不可靠的UT,使两个池能够共同演化。最后,当多个码在最高传数下保持平局时,它会从最大的输出共识簇中选出最终码,因为正确的码会同意相同的输入,而错误的码则会发散。在四个具有挑战性的基准测试上的实验显示,Qwen2.5-7B-Instruct上的CoSPlay将平均BoN提升至33.2%,UT准确率从14.6%提升至78.3%,与RLVR模型CURE-7B持平甚至超越。应用于CURE-7B时,BoN进一步提升了5.7%。CoSplay还能在多种骨干网络上实现推广,在类似代币预算下优于无GTTTS基线,随着预算扩大,持续增长。这些结果表明,无需GT数据即可实现可扩展的竞争代码生成推断策略。
B-GRTO: Bootstrapped Group Relative Tool Optimization for Referring Segmentation
B-GRTO:引导组相对工具优化用于引用分割
- Authors: Mario Markov, Stefan Maria Ailuro, Mohammad Mahdi, Luc Van Gool, Danda Pani Paudel (INSAIT, Sofia University "St. Kliment Ohridski")
- Subjects: Subjects:
Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/2605.23500
- Pdf link: https://arxiv.org/pdf/2605.23500
- Abstract
Segmentation is a fundamental task in computer vision, underpinning pixel-level scene understanding and serving as a cornerstone for applications ranging from autonomous perception to medical image analysis. For complex referring segmentation, recent methods pair large vision-language models with segmentation decoders: the former analyzes the image and prompt, while the latter predicts the target mask. Although reinforcement learning improves reasoning-intensive vision-language systems, trainable tools such as segmentation decoders are typically optimized separately with differentiable objectives, and the principled integration of such objectives into reinforcement learning remains underexplored. Thus, we introduce group relative tool optimization (GRTO), a mathematically grounded framework for jointly optimizing a policy with differentiable tool use. GRTO reuses group relative policy optimization (GRPO) rollouts to optimize the auxiliary tool objective, letting decoder gradients complement policy rewards. Further, we derive Bootstrapped-GRTO (B-GRTO), a pre-training method that cheaply bootstraps the tool, leading to faster convergence and superior performance. Across three challenging referring segmentation settings, B-GRTO results in substantial improvements over plain GRPO, matching or surpassing domain-specific state-of-the-art methods. This demonstrates the value of unifying reinforcement learning with differentiable auxiliary objectives for reasoning-intensive segmentation.
- 中文摘要
分割是计算机视觉中的一项基础任务,支撑着像素级场景的理解,并作为从自主感知到医学图像分析等应用的基石。对于复杂的指称分割,最新方法将大型视觉语言模型与切割解码器结合:前者分析图像和提示词,后者预测目标掩码。尽管强化学习改进了推理密集型视觉语言系统,但可训练工具如分割解码器通常分别优化并设定可微目标,且将这些目标原则性地整合进强化学习的过程仍未被充分探索。因此,我们引入了群相对工具优化(GRTO),这是一个基于数学基础的框架,用于联合优化具有可微分工具使用的策略。GRTO 重用了组相对策略优化(GRPO)推广来优化辅助工具目标,使解码器梯度与策略奖励相辅相成。此外,我们推导出了Bootstrapped-GRTO(B-GRTO),这是一种预训练方法,低成本地引导工具,从而实现更快的收敛和卓越的性能。在三种具有挑战性的引用分割设置中,B-GRTO相较普通GRPO取得了显著提升,甚至超越了领域特定的最先进方法。这展示了将强化学习与可微辅助目标统一的价值,用于推理密集型分割。
Precise: SDE-Consistent Stochastic Sampling for RL Post-Training of Flow-Matching Models
精确:流匹配模型的强化学习后SDE一致随机抽样
- Authors: Jade Zou, Tao Huang, Weijie Kong, Junzhe Li, Yue Wu, Qi Tian, Jiangfeng Xiong, Jianwei Zhang, Liefeng Bo, Zhao Zhong
- Subjects: Subjects:
Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/2605.23522
- Pdf link: https://arxiv.org/pdf/2605.23522
- Abstract
Reinforcement learning (RL) has become an effective way to improve prompt alignment and perceptual quality in diffusion and flow-matching generators. A critical step for applying online RL to flow matching is turning the deterministic sampling trajectory into a stochastic policy, typically by replacing the reverse-time Ordinary Differential Equation (ODE) with a Stochastic Differential Equation (SDE). The stochastic sampler, controlling the exploration behavior and denoising dynamics, is thus part of the policy, and its design can significantly affect the reward optimization performance. We break down the sampler design into two interdependent components: choosing the right amount of stochastic exploration, and discretizing the resulting SDE faithfully at the small step counts used in RL. To address the first component, we analyze the inherent tension between exploration and stability in denoising and derive an SDE schedule that balances the two. Turning to the discretization challenge, we use a toy example to show that existing samplers can deviate from the flow-matching process, either by introducing excessive discretization noise or by relying on heuristic rules that do not guarantee convergence to the data distribution. To address these issues, we propose Precise, a new stochastic sampler that balances effective exploration with stability. Crucially, Precise keeps the denoising trajectory SDE-consistent through a novel approximation that freezes the clean-latent posterior mean, resolving the excess noise issue in standard samplers. Extensive experiments demonstrate that this formulation leads to significantly faster and more stable reward optimization via reinforcement learning, achieving state-of-the-art alignment scores (e.g., PickScore, HPSv2.1) while requiring 13.1-53.2% less wall-clock training time to match the best in-domain performance of prior samplers.
- 中文摘要
强化学习(RL)已成为提升扩散和流动匹配生成器中提示对齐和感知质量的有效方法。将在线强化学习应用于流匹配的关键一步是将确定性采样轨迹转化为随机策略,通常通过用随机微分方程(SDE)替代逆时间常微分方程(ODE)。因此,控制探索行为和去噪动态的随机采样器是策略的一部分,其设计能显著影响奖励优化性能。我们将采样器设计拆分为两个相互依赖的部分:选择合适的随机探索量,以及忠实地在强化学习中使用的小步数下离散化所得的SDE。为解决第一个部分,我们分析了探测与去噪稳定性之间的固有张力,并推导出一个平衡两者的SDE调度。谈到离散化挑战,我们用一个玩具示例说明现有采样器可能偏离流动匹配过程,要么通过引入过多离散化噪声,要么依赖不保证收敛于数据分布的启发式规则。为解决这些问题,我们提出了 Presise 这一新型随机采样器,能够在有效探索与稳定性之间取得平衡。关键是,Precise 通过一种新颖的近似方法保持了去噪轨迹的 SDE 一致性,该近似冻结了干净潜在的后验平均值,解决了标准采样器中的过剩噪声问题。大量实验表明,这种表述通过强化学习实现了显著更快、更稳定的奖励优化,实现了最先进的比对分数(如PickScore、HPSv2.1),同时需要13.1%-53.2%的壁钟训练时间,以匹配以往采样器的最佳域内性能。
Goal-Conditioned Agents that Learn Everything All at Once
目标条件化的代理,一次性学会一切
- Authors: Michael Matthews, Matthew Jackson, Michael Beukman, Thomas Foster, Alistair Letcher, Scott Fujimoto, Cédric Colas, Jakob Foerster
- Subjects: Subjects:
Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2605.23551
- Pdf link: https://arxiv.org/pdf/2605.23551
- Abstract
A goal-conditioned reinforcement learning agent exploring an environment will see a wealth of information throughout a trajectory, most of which is discarded when only performing on-policy updates with respect to the commanded goal. All-goals learning, where each transition is used for learning off-policy with respect to every goal, allows agents to extract maximal information, however it is usually computationally infeasible when done via naive relabelling. This can be overcome by jointly outputting values and actions for every goal at once, allowing for efficient, parallel all-goals updates with a single pass through the network, in a process we call Learning Everything all at Once (LEO). We show that this approach significantly outperforms other methods on goal-conditioned Craftax and is competitive with existing baselines on continuous control environments, while achieving a >250x speed-up compared to all-goals relabelling. We then go on to show that this approach can be made even more powerful by using LEO as a teacher network, rather than a direct actor. We hope that, by unlocking all-goals learning at scale, LEO can serve as a useful tool for RL practitioners in complex environments. We open source our code.
- 中文摘要
一个目标条件化强化学习代理在探索环境中时,会看到整个过程中大量的信息,而在仅针对指令目标进行策略更新时,这些信息大多会被丢弃。全目标学习中,每个转换都用于针对每个目标的非策略学习,使智能体能够提取最大信息,但当通过朴素重新标记实现时,通常在计算上不可行。通过同时输出每个目标的数值和动作,实现高效、并行的所有目标更新,通过网络一次传递,这一过程我们称之为“一次性学习一切”(LEO)。我们证明,该方法在目标条件化Craftax上显著优于其他方法,并且在连续控制环境中与现有基线方法具有竞争力,同时相比全目标重新标记实现了>250倍的加速。我们进一步展示,如果将LEO作为教师网络而非直接参与者,这种方法可以更加强大。我们希望通过大规模解锁全目标学习,LEO能成为强化学习者在复杂环境中的有用工具。我们开源我们的代码。
SafeSABR: Risk-Calibrated Adaptive Bitrate Streaming over Starlink Networks
SafeSABR:通过Starlink网络的风险校准自适应码率流媒体
- Authors: Hongjun Xie, Jiahang Zhu, Zhiming Shao, Chao Fan, Zenghui Zhang, Genke Yang, Pengcheng Luo
- Subjects: Subjects:
Systems and Control (eess.SY); Networking and Internet Architecture (cs.NI)
- Arxiv link: https://arxiv.org/abs/2605.23560
- Pdf link: https://arxiv.org/pdf/2605.23560
- Abstract
Starlink, as a representative low Earth orbit (LEO) satellite broadband system, makes high-bitrate video streaming possible in regions where terrestrial broadband is unavailable. However, its access links exhibit rapid throughput fluctuations caused by satellite mobility and handovers. Existing learned adaptive bitrate (ABR) algorithms can achieve high average quality of experience (QoE), yet high-bitrate Starlink streaming exposes severe session-level rebuffering that is not captured by average QoE alone. To address it, this paper proposes SafeSABR, a risk-calibrated learned ABR framework for Starlink networks. SafeSABR formulates Starlink ABR as a QoE--severe-risk tradeoff and follows a three-stage design: behavior-cloning pretraining learns a high-QoE ABR prior, risk-calibrated reinforcement learning (RL) fine-tuning reduces severe-tail action tendencies, and a runtime safety auditor uses safe-capacity lower bounds to check policy-requested bitrates before execution. Experiments on real Starlink traces compare SafeSABR with online, prediction-assisted, and learned ABR baselines. Compared with advanced methods, SafeSABR reduces severe-stall sessions from 22.8% to 7.2% and worst-5% session rebuffering from 54.30 s to 22.68 s, with a 1.8% QoE cost. Component analyses further show that risk-calibrated fine-tuning and safe-capacity auditing reduce unsafe bitrate decisions and downstream severe-session rebuffering. These results show that combining risk-calibrated policy learning with decision-aware safe throughput forecasting can move learned ABR toward a safer QoE--severe-risk operating point under volatile Starlink networks.
- 中文摘要
Starlink作为一种代表性的近地轨道(LEO)卫星宽带系统,使得在地面宽带无法覆盖的地区实现高比特率视频流。然而,其接入链路因卫星移动性和切换导致吞吐量快速波动。现有的自适应比特率(ABR)算法能够实现高平均体验质量(QoE),但高比特率的Starlink流式流暴露了严重的会话级重缓冲,而这些重缓冲是单靠平均QoE无法捕捉到的。为此,本文提出了SafeSABR,一个针对Starlink网络的风险校准学习式ABR框架。SafeSABR将Starlink ABR定为QoE——严重风险权衡,遵循三阶段设计:行为克隆预训练先学习高QoE ABR,风险校准强化学习(RL)微调减少严重尾部动作倾向,运行时安全审计器使用安全容量下限检查策略请求的比特率后执行。在真实的Starlink追踪上进行实验,将SafeSABR与在线、预测辅助和学习到的ABR基线进行了比较。与高级方法相比,SafeSABR将严重停滞会话从22.8%降至7.2%,最差5%会话缓冲从54.30秒降至22.68秒,QoE成本为1.8%。组件分析进一步表明,风险校准的微调和安全容量审计能减少不安全的码率决策和下游严重会话缓冲。这些结果表明,将风险校准的策略学习与决策感知的安全吞吐量预测相结合,可以推动学习中的ABR在波动性星链网络下达到更安全的QoE——即严重风险运营点。
ARMS: Automatic Reward Shaping for Sparse-Reward Multi-Agent Reinforcement Learning
ARMS:稀疏奖励多智能体强化学习中的自动奖励塑造
- Authors: Elie Abboud, Oren Gal
- Subjects: Subjects:
Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2605.23562
- Pdf link: https://arxiv.org/pdf/2605.23562
- Abstract
Sparse rewards are a major bottleneck in multi-agent reinforcement learning (MARL), where simultaneous learning induces non-stationarity and makes reward design especially delicate. Reward shaping can accelerate learning, but in the multi-agent setting it must preserve the strategic structure of the problem rather than merely improve short-term optimization. We propose Automatic Reward-shaping in Multi-agent Systems (ARMS), a self-supervised reward shaping framework for MARL that learns dense shaping signals from sparse environmental rewards through trajectory ranking. Since single-agent trajectory-ranking guarantees do not directly transfer to MARL, we reformulate policy invariance through conditional best-response reasoning, and show that if certain conditions hold, then using shaping rewards preserves each agent's best-response set under fixed opponent policies, and consequently preserve the set of Nash equilibria. Guided by this perspective, ARMS alternates between policy learning and reward learning while sharing shaping parameters across agents for efficiency. Experiments in a partially observable multi-agent pathfinding domain show that ARMS improves sampling efficiency under increasing reward sparsity and agent count, generalizes to unseen environments, and reveals a MARL-specific failure mode in which limited exploration and coupled policy--reward dynamics induce oscillatory behavior. Increasing exploration mitigates this effect and stabilizes learning. To the best of our knowledge, ARMS is the first automatic reward shaping framework for MARL whose design is motivated by a game-theoretic equilibrium-preservation result.
- 中文摘要
稀疏奖励是多智能体强化学习(MARL)中的一个主要瓶颈,同时学习会导致非平稳性,使得奖励设计变得特别敏感。奖励塑造可以加速学习,但在多智能体环境中,它必须保持问题的战略结构,而不仅仅是改善短期优化。我们提出了多智能体系统中的自动奖励塑造(ARMS),这是一种自监督的MARL奖励塑造框架,通过轨迹排名从稀疏的环境奖励中学习密集的塑造信号。由于单一代理轨迹排名保证不直接转移到MARL,我们通过条件最佳反应推理重新表述策略不变性,并证明如果某些条件成立,则通过塑造奖励保持每个代理在固定对方策略下的最佳反应集,从而保持纳什均衡集合。基于这一视角,ARMS在策略学习和奖励学习之间交替进行,同时在不同代理之间共享塑造参数以提高效率。在部分可观测的多智能体路径寻找领域中的实验表明,ARMS在奖励稀疏度和智能体数量增加下提升采样效率,推广到未见环境,并揭示了一种MARL特有的失败模式,在有限的探索和耦合的策略-奖励动态中,诱导振荡行为。增加探索可以减轻这一影响,稳定学习。据我们所知,ARMS是第一个基于博弈论均衡保持结果设计的MARL自动奖励塑造框架。
Understanding Goal Generalisation in Sequential Reinforcement Learning
理解顺序强化学习中的目标泛化
- Authors: Jason Ross Brown, Edward James Young
- Subjects: Subjects:
Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2605.23565
- Pdf link: https://arxiv.org/pdf/2605.23565
- Abstract
Reinforcement learning agents often exhibit unintended goal-directed behaviour outside their training distribution, but we currently lack a principled understanding of how such agents will generalise to novel environments based on their training history. We address this gap for agents trained sequentially on one or more tasks. We study over 100 sequential training pipelines, evaluating behaviour across over 250 out-of-distribution environments. We find that salient features drive generalisation, and that goals learnt early in training can persist and influence those acquired later. To explain these phenomena, we introduce latent policy gradients, a method that predicts what out-of-distribution behaviour a training pipeline will likely induce. Our method simulates the evolution of low-dimensional latent variables during training according to what would achieve high reward on the training objective with respect to a simple model of how the latent variables map to behaviour. It achieves strong predictive accuracy, generalises to unseen types of training pipeline, and is interpretable. Our findings demonstrate that while out-of-distribution RL agent behaviour is dependent on the whole training pipeline, this dependence has an underlying structure we can capture, laying groundwork for understanding goal generalisation from a developmental perspective.
- 中文摘要
强化学习代理常表现出训练分布之外的非预期目标导向行为,但我们目前缺乏原则性理解,说明这些代理如何基于训练历史推广到新环境。我们针对连续训练一个或多个任务的代理填补了这一空白。我们研究了100多个连续培训流程,评估了250多个非分发环境的行为。我们发现显著特征推动泛化,早期培训中学到的目标可以持续存在,并影响后续获得的目标。为解释这些现象,我们引入了潜在政策梯度,这是一种预测培训管道可能引发的非分配行为的方法。我们的方法模拟了训练过程中低维潜变量的演变,基于潜在变量与行为映射的简单模型,在训练目标上实现高回报。它实现了强大的预测准确性,能够推广到看不见的训练流程类型,并且具有可解释性。我们的发现表明,虽然分布外的强化学习代理行为依赖于整个训练流程,但这种依赖性有我们可以捕捉的潜在结构,为从发展视角理解目标泛化奠定基础。
Less Effort, Shorter Proofs: Reinforcement Learning for Security Protocol Analysis in Tamarin
更少努力,更短的证明:Tamarin 安全协议分析的强化学习
- Authors: Matthias Cosler, Cas Cremers, Bernd Finkbeiner, Mohamed Ghanem, Niklas Medinger
- Subjects: Subjects:
Cryptography and Security (cs.CR); Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/2605.23643
- Pdf link: https://arxiv.org/pdf/2605.23643
- Abstract
Tools like Tamarin and ProVerif have achieved notable success in analyzing and verifying complex real-world protocols such as EMV, 5G, and WPA2, even detecting zero-day exploits. Despite these successes, verifying such protocols remains a time-consuming, challenging task, often requiring significant human effort and expertise. In this paper, we present a reinforcement learning (RL) framework inspired by AlphaZero and AlphaProof that implements a new style of proof search for Tamarin. We have developed a stateless API for Tamarin that acts as a classical RL environment. We guide a Monte Carlo Tree Search (MCTS) by a neural heuristic that learns from completed subproofs. We evaluate our framework on 16 case studies, ranging from classical protocol models to challenging state-of-the-art protocol models from recent publications. Our method finds more proofs automatically than Tamarin's standard search and produces shorter proofs than both the standard and human-engineered heuristics. Our pipeline is applicable out of the box to assist Tamarin users in active research, reducing the human effort required. Moreover, our standardized interface provides a programmatic way for users to interact with Tamarin. Finally, our work demonstrates the promising potential of adapting RL-based methods to the Tamarin domain.
- 中文摘要
像Tamarin和ProVerif这样的工具在分析和验证复杂现实世界协议(如EMV、5G和WPA2)方面取得了显著成功,甚至能检测零日漏洞利用。尽管取得了这些成功,验证此类协议仍是一项耗时且充满挑战的任务,常常需要大量人力和专业知识。本文提出了一个受AlphaZero和AlphaProof启发的强化学习(RL)框架,实现了一种新的Tamarin证明搜索方式。我们为Tamarin开发了一个无状态API,作为经典的强化学习环境。我们通过神经启发式方法引导蒙特卡洛树搜索(MCTS),该方法从已完成的子证明中学习。我们基于16个案例研究评估了我们的框架,涵盖了从经典方案模型到近期出版物中具有挑战性的最先进方案模型。我们的方法自动找到的证明比 Tamarin 的标准搜索更多,并且产生的证明比标准和人工工程启发式都更短。我们的流程开箱即用,可协助Tamarin用户进行积极研究,减少所需的人力劳动。此外,我们的标准化界面为用户提供了与Tamarin互动的程序化方式。最后,我们的工作展示了将基于强化学习的方法应用于塔马林领域的有前景潜力。
One Policy, Infinite NPCs: Persona-Traceable Shared RL Policies for Scalable Game Agents
One Policy, Infinite NPC: 可扩展游戏代理的Persona可追踪共享强化学习策略
- Authors: Yoosung Hong
- Subjects: Subjects:
Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2605.23652
- Pdf link: https://arxiv.org/pdf/2605.23652
- Abstract
On a 300-persona life-simulation benchmark, pcsp achieves compositional zero-shot persona identification up to 17x above chance, Spearman rho approx 0.73 semantic-behavioral alignment, and 22x faster inference than an LLM-as-policy baseline. Life simulation games require hundreds to thousands of non-player characters (NPCs) that behave consistently with distinct personalities while remaining controllable through designer-authored natural language. Existing methods fail on constraints like persona consistency, controllability, or real-time inference. We introduce pcsp (Persona Conditioned Shared Policy), a single reinforcement learning policy conditioned on frozen LLM embeddings of free-form persona descriptions. pcsp combines once-per-NPC persona encoding, low-rank persona projection, neural persona conditioning, and a PPO + InfoNCE consistency + KL diversity training objective. Across three experimental settings, ablations show that the InfoNCE trajectory-consistency objective is load bearing: removing it collapses zero-shot persona identification to chance. External validation on Melting Pot 2.4.0 substrates confirms that our method produces persona-conditioned behavioral divergence in multi-agent strategic environments. We distinguish two senses of held-out evaluation: compositional zero-shot and vocabulary-expansion held-out. Finally, a UE5 deployment reproduces the in-engine persona-conditioning ablation at 64 agents with a low failure rate, showing that the sub-frame inference profile survives in a commercial game engine. These results prove that shared RL policies can support scalable, real-time, persona-conditioned NPC control.
- 中文摘要
在300人角色的生活模拟基准测试中,pcsp实现的组合零样本人物识别速度高出概率的17倍,Spearman rho约为0.73语义-行为对齐,推断速度是LLM作为策略基线的22倍。生活模拟游戏需要数百到数千个非玩家角色(NPC),这些角色具有鲜明的个性,同时通过设计师编写的自然语言保持可控性。现有方法在诸如人物一致性、可控性或实时推理等约束条件下失效。我们引入了pcsp(Persona Conditioned Shared Policy),这是一种单一强化学习策略,基于对自由形式人物描述的冻结LLM嵌入。PCSP 结合了每个 NPC 角色编码一次、低阶角色投影、神经角色条件反射以及 PPO + InfoNCE 一致性 + KL 多样性训练目标。在三种实验环境中,消融显示InfoNCE轨迹一致性目标是负荷的:移除它会使零次人格识别归于偶然性。对 Melting Pot 2.4.0 基底的外部验证证实,我们的方法在多智能体战略环境中产生人格条件反射的行为分歧。我们区分了两种保留评价的含义:组成零选和词汇扩展保留。最后,UE5部署在64个代理上以低失败率重现了引擎内的人格条件消融,显示子帧推理配置文件在商业游戏引擎中依然存在。这些结果证明共享强化学习策略可以支持可扩展的实时、人格条件NPC控制。
OnePred: Next-Query Prediction via Recursive Intent Memory in Multi-Turn Conversations
OnePred:通过递归意图记忆实现多回合对话中的下一查询预测
- Authors: Jiangwang Chen, Bowen Zhang, Zixin Song, Jiazheng Kang, Xiao Yang, Da Zhu, Guanjun Jiang
- Subjects: Subjects:
Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2605.23668
- Pdf link: https://arxiv.org/pdf/2605.23668
- Abstract
Although large language model (LLM) conversational systems process millions of multi-turn dialogues daily, they remain fundamentally reactive: they respond only after the user types a query. A key step toward proactive interaction is next-query prediction, which anticipates the user's subsequent query based solely on the preceding dialogue. Progress on this task is hindered by the lack of dedicated benchmarks and a fundamental efficiency--quality trade-off: naively concatenating full dialogue history incurs linearly growing token consumption, while truncating to the latest turn discards crucial cross-turn context. Our key insight is that accurate prediction does not require re-reading raw history; it suffices to track the user's evolving intent trajectory across topics, unresolved needs, and interest shifts. We propose OnePred, which maintains a recursively updated memory as its sole cross-turn context, bounding the per-turn cost independently of conversation length. We train the model via a two-stage reinforcement learning pipeline that first teaches what to predict, then what to compress, shaping the memory into a prediction-oriented intent chain. To establish a rigorous testbed, we introduce NQP-Bench, spanning three diverse subsets. Experiments demonstrate that OnePred reduces per-turn token consumption by up to 22$\times$ compared to full-history inputs while consistently exceeding all baselines in prediction quality, with larger gains on longer conversations. Our code is publicly available at this https URL.
- 中文摘要
尽管大型语言模型(LLM)对话系统每天处理数百万次多回合对话,但它们本质上仍是被动的:仅在用户输入查询后才响应。迈向主动互动的关键一步是下一查询预测,即仅根据前一对话预测用户的下一次查询。这一任务的进展受限于缺乏专门的基准和根本的效率——质量权衡:天真地连接完整对话历史会引发代币消费线性增长,而仅限于最新回合则丢弃了关键的跨回合上下文。我们的关键见解是,准确的预测不需要重新阅读原始历史;只需跟踪用户在不同主题、未解决需求和兴趣转变中意图的演变轨迹即可。我们提出OnePred,它以递归更新的内存作为唯一的跨回合上下文,且对每回合成本的限制与对话时长无关。我们通过两阶段强化学习流程训练模型,先教预测什么,然后压缩什么,将记忆塑造成以预测为导向的意图链。为了建立一个严格的测试平台,我们引入了NQP-Bench,涵盖三个不同的子集。实验表明,与全历史输入相比,OnePred每回合代币消耗可减少多达22美元\时间$,同时预测质量持续超过所有基线,且在较长对话中获得更大收益。我们的代码在此 https URL 公开。
SeedER: Seed-and-Expand Retrieval from Knowledge Graphs
SeedER:从知识图谱中进行种子与扩展检索
- Authors: Hamed Shirzad, Frederik Wenkel, Dominique Beaini, Danica J. Sutherland, Emmanuel Noutahi
- Subjects: Subjects:
Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/2605.23753
- Pdf link: https://arxiv.org/pdf/2605.23753
- Abstract
Knowledge graphs (KGs) offer a rich representation for relational knowledge, but their irregular structure makes retrieval challenging: ego-graph expansion grows rapidly, and dense embedding methods struggle with multi-hop compositional queries. Existing agent-based graph exploration approaches, while expressive, are often too expensive for large-scale retrieval. We introduce SeedER (Seed-and-Expand Retrieval), a retrieval framework that explicitly leverages KG structure through iterative, low-cost expansion. SeedER first seeds a compact set of core nodes using lightweight dense and entity-based retrieval, then selectively expands this set via a learned graph-aware policy trained with reinforcement learning. This design decomposes global reasoning into reusable local decisions, enabling efficient discovery of query-relevant nodes while tightly controlling expansion cost. We show theoretical limitations of dense retrieval on compositional graph queries, and establish advantages of SeedER from both compositional generalization and graph-constrained submodular optimization perspectives. Empirically, SeedER substantially improves recall with compact candidate sets over strong dense and graph-augmented baselines, making it an effective first-stage retriever for knowledge-intensive reasoning systems.
- 中文摘要
知识图(KGs)为关系知识提供了丰富的表示,但其不规则结构使得检索变得困难:自我图扩展速度迅速,密集嵌入方法在多跳组合查询时遇到困难。现有基于主体的图探索方法虽然表达力强,但通常成本过高,不适合大规模检索。我们引入了SeedER(种子与扩展检索),这是一个通过迭代低成本扩展明确利用KG结构的检索框架。SeedER首先通过轻量级、密集和基于实体的检索技术为一组紧凑的核心节点做种,然后通过通过强化学习训练的图感知策略选择性地扩展该集。该设计将全局推理分解为可重用的局部决策,使得高效发现与查询相关的节点,同时严格控制扩展成本。我们展示了合成图查询中密集检索的理论局限性,并从组合推广和图约束子模块优化视角阐明了SeedER的优势。从经验上看,SeedER通过紧凑候选集在强密集和图增基线上显著提升了召回率,使其成为知识密集型推理系统的有效第一阶段检索工具。
Robotic Strawberry Harvesting with Robust Vision and Deep Reinforcement Learning based Sim-to-Real Control
基于模拟到真实控制的机器人草莓采摘与强大视觉和深度强化学习
- Authors: Al Bashir, Shao-Yang Chang, Partho Ghose, Prem Raj, Chen-Kang Huang, Azlan Zahid
- Subjects: Subjects:
Robotics (cs.RO)
- Arxiv link: https://arxiv.org/abs/2605.23863
- Pdf link: https://arxiv.org/pdf/2605.23863
- Abstract
This study presents a closed-loop robotic strawberry harvesting system that combines a robust vision module, simulation-trained deep reinforcement learning (DRL) control, and ROS-based realrobot execution. For perception, we propose HRAttnEdge-YOLO26-seg, a modified YOLO26-seg architecture that incorporates a high-resolution P2 branch, segmentation-path attention, and edgesupervised prototype learning to improve instance segmentation in cluttered scenes. For control, we train a target-conditioned Proximal Policy Optimization (PPO) policy in Isaac Lab to produce smooth joint-position commands for a UR10e manipulator and deploy it on a UR10e robot for targetfruit reaching and harvesting. This simulation-based approach reduces hardware dependency, lowers development cost, and allows scalable policy training without exhaustive physical trials before real deployment. The proposed vision model demonstrated the highest overall performance among the evaluated methods. On both self-collected and public datasets, the model showed a 10 to 14% improvement in segmentation performance. In controlled in-house tests, the PPO controller produced stable and dynamically smoother motion than a inverse kinematics (IK)-based MoveIt baseline. In greenhouse trials, the proposed integrated system harvested 281 strawberries, achieving 96.6% reaching success, 91.3% grasp-and-pull success, and 84.3% overall harvesting success. These results illustrate that task-specific perception combined with simulation-trained PPO can serve as a practical and resource-efficient alternative to conventional planner-dependent reaching in manipulation, enabling reliable closed-loop robotic harvesting in complex agricultural environments.
- 中文摘要
本研究提出了一套闭环机器人草莓采摘系统,结合了稳健的视觉模块、仿真训练的深度强化学习(DRL)控制和基于ROS的真实机器人执行。在感知方面,我们提出了HRAttnEdge-YOLO26-seg,这是一种改进型YOLO26-seg架构,结合了高分辨率P2分支、分割路径注意力和边缘监督原型学习,以改善杂乱场景中的实例分割。在控制方面,我们在Isaac实验室训练了目标条件近端策略优化(PPO),为UR10e机械臂生成平滑的联合位置指令,并将其部署到UR10e机器人上,实现目标果采集和采摘。这种基于仿真的方法减少了对硬件的依赖,降低了开发成本,并允许在实际部署前无需经过详尽物理试验即可进行可扩展的策略训练。所提出的视觉模型在评估方法中表现最佳。在自收集和公开数据集中,模型显示分段性能提升了10%到14%。在受控的内部测试中,PPO控制器产生的运动比基于逆运动学(IK)的MoveIt基线更稳定且动态更平滑。在温室试验中,拟议的综合系统收获了281个草莓,成功率为96.6%,抓取拉取成功率为91.3%,整体收获成功率为84.3%。这些结果表明,任务专属感知与仿真训练的PPO结合,可以作为传统依赖规划者操作的实用且资源高效的替代方案,实现复杂农业环境中可靠的闭环机器人收割。
Geo-Align: Video Generation Alignment via Metric Geometry Reward
地理对齐:通过度量几何奖励实现视频生成对齐
- Authors: Zizun Li, Haoyu Guo, Runzhe Teng, Chunhua Shen, Tong He
- Subjects: Subjects:
Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/2605.23903
- Pdf link: https://arxiv.org/pdf/2605.23903
- Abstract
Camera-controlled video generation has achieved remarkable progress in recent years. However, existing video-to-video re-rendering methods primarily rely on Supervised Fine-Tuning using synthetic datasets. At present, there is an extreme scarcity of synchronized, multi-view real-world video data. Consequently, the prevailing paradigm often exhibits limited generalization when processing out-of-distribution real-world videos, with models struggling to accurately adhere to physical scales and camera trajectories. To bridge this gap, we propose Geo-Align, the first Reinforcement Learning framework specifically designed for camera-controlled video re-rendering. Built upon a pretrained model, we optimize the model through a scale-aware perceptual reward mechanism. Specifically, we introduce a metric 3D estimator to extract precise camera trajectories from generated videos, explicitly penalizing deviations in rotation and translation. Furthermore, we meticulously designed a data pipeline strategy based on real-world conditioning videos and target camera trajectories derived from synthetic data, eliminating the reliance on paired data. Extensive experiments demonstrate that Geo-Align consistently outperforms existing supervised learning baselines in both precise camera controllability and visual fidelity, indicating the effectiveness of our method.
- 中文摘要
近年来,摄像机控制的视频生成取得了显著进步。然而,现有的视频到视频重渲染方法主要依赖于使用合成数据集进行监督微调。目前,实时多视角的实时视频数据极为稀缺。因此,主流范式在处理非发行的真实视频时往往呈现有限的泛化,模型难以准确遵循物理尺度和摄像机轨迹。为弥合这一差距,我们提出了Geo-Align,这是首个专为摄像头控制视频重新渲染设计的强化学习框架。基于预训练模型,我们通过规模感知奖励机制优化模型。具体来说,我们引入了一种度量三维估计器,从生成的视频中提取精确的相机轨迹,明确惩罚旋转和平移的偏差。此外,我们基于真实世界的条件视频和基于合成数据的目标摄像头轨迹,精心设计了数据流水线策略,消除了对配对数据的依赖。大量实验表明,Geo-Align在精确的摄像头控制性和视觉真实度上始终优于现有的监督学习基线,显示了我们方法的有效性。
Keyword: diffusion policy
SCRIPT: Scalable Diffusion Policy with Multi-stage Training for Language-driven Physics-Based Humanoid Control
SCRIPT:可扩展扩散政策,支持多阶段训练,用于语言驱动的物理类人控制
- Authors: Jingyan Zhang, Han Liang, Ruichi Zhang, Bin Li, Juze Zhang, Xin Chen, Jingya Wang, Lan Xu, Jingyi Yu
- Subjects: Subjects:
Graphics (cs.GR); Machine Learning (cs.LG); Robotics (cs.RO)
- Arxiv link: https://arxiv.org/abs/2605.22894
- Pdf link: https://arxiv.org/pdf/2605.22894
- Abstract
Controlling physics-based humanoids from natural-language instructions is a critical step toward general-purpose embodied agents. However, existing methods remain constrained by a tension between semantic expressiveness and physical feasibility, often failing to jointly achieve faithful instruction following, high-quality motion, and stable long-horizon control. We propose SCRIPT, a scalable diffusion policy with a multi-stage training framework for language-driven physics-based humanoid control. The core of SCRIPT is a Joint Action-State-Text Diffusion Transformer (JAST-DiT), which represents actions, physical states, and text as dedicated token streams and couples them through joint attention, enabling direct interaction between language semantics and control dynamics. To stabilize autoregressive control, we introduce a nonlinear history conditioning mechanism, which preserves the dense recent context and samples increasingly sparse cues from long-term history. Beyond supervised imitation pre-training, we propose a post-training stage, further improving the performance using Reinforcement Learning with Hybrid Rewards (RLHR). By injecting learnable noise into the flow-sampling process, RLHR effectively improves motion quality and instruction following within closed-loop simulations using hybrid physical feedback and text rewards. Quantitative evaluations demonstrate that SCRIPT outperforms prior state-of-the-art methods, with gains across text alignment, motion quality, and physical realism metrics. Furthermore, scaling studies on the 1200-hour MotionMillion dataset demonstrate consistent performance gains with model scaling, highlighting SCRIPT's robust scalability for large-scale pre-training. Our code will be publicly available for future research.
- 中文摘要
通过自然语言指令控制基于物理的人形生物,是迈向通用具身代理的关键一步。然而,现有方法仍受限于语义表达性和物理可行性的张力,常常未能共同实现忠实的跟随指令、高质量的运动和稳定的长视野控制。我们提出了SCRIPT,一种可扩展的扩散策略,采用多阶段训练框架,用于基于语言驱动的物理类人控制。SCRIPT的核心是一个联合动作-状态-文本扩散变换器(JAST-DiT),它将动作、物理状态和文本表示为专用的令牌流,并通过联合注意力将它们耦合,实现语言语义与控制动态之间的直接交互。为稳定自回归控制,我们引入了非线性历史条件机制,保留密集的近期背景,并采样越来越稀疏的长期历史线索。除了监督模仿的预训练外,我们还提出了一个训练后阶段,通过混合奖励强化学习(RLHR)进一步提升表现。通过向流采样过程注入可学习噪声,RLHR有效提升闭环模拟中的运动质量和指令跟随,采用混合物理反馈和文本奖励。定量评估显示,SCRIPT在文本对齐、运动质量和物理真实性指标方面均优于以往最先进方法。此外,基于1200小时MotionMillion数据集的缩放研究显示,模型缩放持续带来性能提升,凸显了SCRIPT在大规模预训练中的强大可扩展性。我们的代码将公开供未来研究使用。
Semantically Structured Mixture-of-Experts for Compositional Robotic Manipulation
用于合成机器人操作的语义结构专家混合
- Authors: Chengyu Deng, Guanqi Chen, Yizhou Chen, Zejia Liu, Zhiwen Ruan, Guanhua Chen, Jia Pan
- Subjects: Subjects:
Robotics (cs.RO)
- Arxiv link: https://arxiv.org/abs/2605.23477
- Pdf link: https://arxiv.org/pdf/2605.23477
- Abstract
Diffusion-based policies have established a new standard for precise robotic manipulation but face a critical scalability bottleneck: high-performance models are computationally expensive, while lightweight alternatives often fail to generalize across diverse multi-task environments. Mixture-of-Experts (MoE) architectures offer a promising path to efficiency by activating only a subset of parameters. However, existing MoE routing mechanisms typically rely on low-level noise or latent statistics, ignoring the compositional nature of manipulation tasks. This can fragment reusable behaviors across experts, limiting interpretability and transferability. We introduce Semantically Structured Mixture-of-Experts Diffusion Policy (SMoDP) for compositional robotic manipulation, a framework that grounds expert specialization in semantic task structure. SMoDP leverages a lightweight, inference-time skill predictor, supervised by offline annotations from Vision-Language Models (VLMs), to route action chunks to experts specialized for specific behavioral phases. To ensure robust assignment, we propose a dual contrastive alignment strategy that grounds multi-modal observations in language-defined skill semantics (Inter-modal) while enforcing routing consistency across visually distinct but functionally related behaviors (Intra-modal). Our approach outperforms representative diffusion and MoE-based baselines on multi-task benchmarks with significantly improved parameter efficiency and demonstrates effective compositional transfer to novel tasks through parameter-efficient fine-tuning. Project website: this https URL
- 中文摘要
基于扩散的策略确立了精确机器人操作的新标准,但面临一个关键的可扩展性瓶颈:高性能模型计算成本高昂,而轻量级替代方案往往无法在多任务环境中推广。专家混合架构(MoE)通过激活部分参数,提供了一条有前景的效率路径。然而,现有的MoE路由机制通常依赖于低级噪声或潜在统计,忽视了操作任务的组合性质。这可能导致可重复使用的行为分散在专家之间,限制了可解释性和可转移性。我们引入了用于合成机器人操作的语义结构化专家混合扩散策略(SMoDP),这一框架为语义任务结构领域的专家专精奠定基础。SMoDP利用轻量级推理时间技能预测器,由视觉语言模型(VLMs)的离线注释监督,将动作块路由给专门针对特定行为阶段的专家。为确保稳健分配,我们提出了一种双重对比对比策略,将多模态观察建立在语言定义的技能语义(跨模态)中,同时在视觉上不同但功能相关的行为间强制路由一致性(模态内)。我们的方法在多任务基准测试中优于代表性扩散和基于MoE的基线,参数效率显著提升,并通过参数高效的微调展示了对新颖任务的有效组合转移。项目网站:此 https URL