生成时间: 2025-10-09 16:28:55 (UTC+8); Arxiv 发布时间: 2025-10-09 20:00 EDT (2025-10-10 08:00 UTC+8)
今天共有 39 篇相关文章
Keyword: reinforcement learning
MCCE: A Framework for Multi-LLM Collaborative Co-Evolution
MCCE:多法学硕士协作共同进化的框架
- Authors: Nian Ran, Zhongzheng Li, Yue Wang, Qingsong Ran, Xiaoyuan Zhang, Shikun Feng, Richard Allmendinger, Xiaoguang Zhao
- Subjects: Subjects:
Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2510.06270
- Pdf link: https://arxiv.org/pdf/2510.06270
- Abstract
Multi-objective discrete optimization problems, such as molecular design, pose significant challenges due to their vast and unstructured combinatorial spaces. Traditional evolutionary algorithms often get trapped in local optima, while expert knowledge can provide crucial guidance for accelerating convergence. Large language models (LLMs) offer powerful priors and reasoning ability, making them natural optimizers when expert knowledge matters. However, closed-source LLMs, though strong in exploration, cannot update their parameters and thus cannot internalize experience. Conversely, smaller open models can be continually fine-tuned but lack broad knowledge and reasoning strength. We introduce Multi-LLM Collaborative Co-evolution (MCCE), a hybrid framework that unites a frozen closed-source LLM with a lightweight trainable model. The system maintains a trajectory memory of past search processes; the small model is progressively refined via reinforcement learning, with the two models jointly supporting and complementing each other in global exploration. Unlike model distillation, this process enhances the capabilities of both models through mutual inspiration. Experiments on multi-objective drug design benchmarks show that MCCE achieves state-of-the-art Pareto front quality and consistently outperforms baselines. These results highlight a new paradigm for enabling continual evolution in hybrid LLM systems, combining knowledge-driven exploration with experience-driven learning.
- 中文摘要
多目标离散优化问题,如分子设计,由于其庞大且非结构化的组合空间,提出了重大挑战。传统的进化算法经常陷入局部最优,而专业知识可以为加速收敛提供重要指导。大型语言模型 (LLM) 提供强大的先验和推理能力,使其成为专业知识重要时的天然优化器。然而,闭源 LLM 虽然探索能力强,但无法更新其参数,因此无法将体验内化。相反,较小的开放模型可以不断微调,但缺乏广泛的知识和推理能力。我们介绍了多 LLM 协作协同进化 (MCCE),这是一个混合框架,它将冻结的闭源 LLM 与轻量级可训练模型结合在一起。系统维护过去搜索过程的轨迹记忆;通过强化学习逐步完善小模型,两个模型在全局探索中相互支持和互补。与模型蒸馏不同,此过程通过相互启发增强了两种模型的能力。多目标药物设计基准的实验表明,MCCE实现了最先进的帕累托前沿质量,并且始终优于基线。这些结果凸显了一种新的范式,将知识驱动的探索与经验驱动的学习相结合,实现混合法学硕士系统的持续发展。
General and Efficient Visual Goal-Conditioned Reinforcement Learning using Object-Agnostic Masks
使用与对象无关的掩码进行通用且高效的视觉目标条件强化学习
- Authors: Fahim Shahriar, Cheryl Wang, Alireza Azimi, Gautham Vasan, Hany Hamed Elanwar, A. Rupam Mahmood, Colin Bellinger
- Subjects: Subjects:
Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/2510.06277
- Pdf link: https://arxiv.org/pdf/2510.06277
- Abstract
Goal-conditioned reinforcement learning (GCRL) allows agents to learn diverse objectives using a unified policy. The success of GCRL, however, is contingent on the choice of goal representation. In this work, we propose a mask-based goal representation system that provides object-agnostic visual cues to the agent, enabling efficient learning and superior generalization. In contrast, existing goal representation methods, such as target state images, 3D coordinates, and one-hot vectors, face issues of poor generalization to unseen objects, slow convergence, and the need for special cameras. Masks can be processed to generate dense rewards without requiring error-prone distance calculations. Learning with ground truth masks in simulation, we achieved 99.9% reaching accuracy on training and unseen test objects. Our proposed method can be utilized to perform pick-up tasks with high accuracy, without using any positional information of the target. Moreover, we demonstrate learning from scratch and sim-to-real transfer applications using two different physical robots, utilizing pretrained open vocabulary object detection models for mask generation.
- 中文摘要
目标条件强化学习 (GCRL) 允许代理使用统一的策略学习不同的目标。然而,GCRL 的成功取决于目标代表的选择。在这项工作中,我们提出了一种基于掩码的目标表示系统,该系统为智能体提供与对象无关的视觉线索,从而实现高效的学习和卓越的泛化。相比之下,现有的目标表示方法,如目标状态图像、三维坐标和单热向量,面临着对看不见的物体的泛化能力差、收敛缓慢以及需要特殊相机等问题。可以处理掩码以生成密集的奖励,而无需计算容易出错的距离。在模拟中使用地面实况掩码进行学习,我们在训练和看不见的测试对象上达到了 99.9% 的准确率。我们提出的方法可用于高精度地执行拾取任务,而无需使用任何目标的位置信息。此外,我们还展示了使用两个不同的物理机器人从头开始学习和模拟到真实的传输应用程序,利用预训练的开放词汇对象检测模型来生成掩码。
Monte Carlo Permutation Search
蒙特卡洛排列搜索
- Authors: Tristan Cazenave
- Subjects: Subjects:
Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2510.06381
- Pdf link: https://arxiv.org/pdf/2510.06381
- Abstract
We propose Monte Carlo Permutation Search (MCPS), a general-purpose Monte Carlo Tree Search (MCTS) algorithm that improves upon the GRAVE algorithm. MCPS is relevant when deep reinforcement learning is not an option, or when the computing power available before play is not substantial, such as in General Game Playing, for example. The principle of MCPS is to include in the exploration term of a node the statistics on all the playouts that contain all the moves on the path from the root to the node. We extensively test MCPS on a variety of games: board games, wargame, investment game, video game and multi-player games. MCPS has better results than GRAVE in all the two-player games. It has equivalent results for multi-player games because these games are inherently balanced even when players have different strengths. We also show that using abstract codes for moves instead of exact codes can be beneficial to both MCPS and GRAVE, as they improve the permutation statistics and the AMAF statistics. We also provide a mathematical derivation of the formulas used for weighting the three sources of statistics. These formulas are an improvement on the GRAVE formula since they no longer use the bias hyperparameter of GRAVE. Moreover, MCPS is not sensitive to the ref hyperparameter.
- 中文摘要
我们提出了蒙特卡洛排列搜索 (MCPS),这是一种通用蒙特卡洛树搜索 (MCTS) 算法,它改进了 GRAVE 算法。当深度强化学习不是一种选择,或者当游戏前可用的计算能力不大时,例如在一般游戏中,MCPS 是相关的。MCPS 的原理是在节点的探索项中包含所有 playouts 的统计数据,其中包含从根到节点路径上的所有移动。我们在各种游戏上广泛测试 MCPS:棋盘游戏、战争游戏、投资游戏、视频游戏和多人游戏。MCPS在所有双人游戏中都比GRAVE有更好的成绩。对于多人游戏来说,它具有相同的结果,因为即使玩家具有不同的优势,这些游戏本质上也是平衡的。我们还表明,使用抽象代码而不是精确代码对 MCPS 和 GRAVE 都有好处,因为它们可以改进排列统计和 AMAF 统计。我们还提供了用于对三个统计来源进行加权的公式的数学推导。这些公式是对 GRAVE 公式的改进,因为它们不再使用 GRAVE 的偏差超参数。此外,MCPS对ref超参数不敏感。
Attention-Enhanced Reinforcement Learning for Dynamic Portfolio Optimization
用于动态投资组合优化的注意力增强强化学习
- Authors: Pei Xue, Yuanchun Ye
- Subjects: Subjects:
Computational Engineering, Finance, and Science (cs.CE)
- Arxiv link: https://arxiv.org/abs/2510.06466
- Pdf link: https://arxiv.org/pdf/2510.06466
- Abstract
We develop a deep reinforcement learning framework for dynamic portfolio optimization that combines a Dirichlet policy with cross-sectional attention mechanisms. The Dirichlet formulation ensures that portfolio weights are always feasible, handles tradability constraints naturally, and provides a stable way to explore the allocation space. The model integrates per-asset temporal encoders with a global attention layer, allowing it to capture sector relationships, factor spillovers, and other cross asset dependencies. The reward function includes transaction costs and portfolio variance penalties, linking the learning objective to traditional mean variance trade offs. The results show that attention based Dirichlet policies outperform equal-weight and standard reinforcement learning benchmarks in terms of terminal wealth and Sharpe ratio, while maintaining realistic turnover and drawdown levels. Overall, the study shows that combining principled action design with attention-based representations improves both the stability and interpretability of reinforcement learning for portfolio management.
- 中文摘要
我们开发了一个用于动态投资组合优化的深度强化学习框架,该框架将狄利克雷策略与横断面注意力机制相结合。狄利克雷公式确保投资组合权重始终可行,自然地处理可交易性约束,并提供一种稳定的方式来探索配置空间。该模型将每个资产的时间编码器与全局注意力层集成在一起,使其能够捕获部门关系、因素溢出效应和其他跨资产依赖关系。奖励函数包括交易成本和投资组合方差惩罚,将学习目标与传统的均值方差权衡联系起来。结果表明,基于注意力的狄利克雷策略在终端财富和夏普比率方面优于等权重和标准强化学习基准,同时保持了现实的周转率和回撤水平。总体而言,研究表明,将原则性行动设计与基于注意力的表示相结合,可以提高强化学习在投资组合管理中的稳定性和可解释性。
Webscale-RL: Automated Data Pipeline for Scaling RL Data to Pretraining Levels
Webscale-RL:用于将 RL 数据扩展到预训练级别的自动化数据管道
- Authors: Zhepeng Cen, Haolin Chen, Shiyu Wang, Zuxin Liu, Zhiwei Liu, Ding Zhao, Silvio Savarese, Caiming Xiong, Huan Wang, Weiran Yao
- Subjects: Subjects:
Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2510.06499
- Pdf link: https://arxiv.org/pdf/2510.06499
- Abstract
Large Language Models (LLMs) have achieved remarkable success through imitation learning on vast text corpora, but this paradigm creates a training-generation gap and limits robust reasoning. Reinforcement learning (RL) offers a more data-efficient solution capable of bridging this gap, yet its application has been constrained by a critical data bottleneck: existing RL datasets are orders of magnitude smaller and less diverse than web-scale pre-training corpora. To address this, we introduce the Webscale-RL pipeline, a scalable data engine that systematically converts large-scale pre-training documents into millions of diverse, verifiable question-answer pairs for RL. Using this pipeline, we construct the Webscale-RL dataset, containing 1.2 million examples across more than 9 domains. Our experiments show that the model trained on this dataset significantly outperforms continual pretraining and strong data refinement baselines across a suite of benchmarks. Notably, RL training with our dataset proves substantially more efficient, achieving the performance of continual pre-training with up to 100$\times$ fewer tokens. Our work presents a viable path toward scaling RL to pre-training levels, enabling more capable and efficient language models.
- 中文摘要
大型语言模型 (LLM) 通过对大量文本语料库的模仿学习取得了显着的成功,但这种范式造成了训练代际差距并限制了稳健的推理。强化学习 (RL) 提供了一种数据效率更高的解决方案,能够弥合这一差距,但其应用受到关键数据瓶颈的限制:现有的强化学习数据集比网络规模的预训练语料库小几个数量级,多样性也更差。为了解决这个问题,我们引入了 Webscale-RL 管道,这是一个可扩展的数据引擎,可以系统地将大规模预训练文档转换为数百万个多样化的、可验证的 RL 问答对。使用这个管道,我们构建了 Webscale-RL 数据集,其中包含超过 9 个领域的 120 万个示例。我们的实验表明,在该数据集上训练的模型在一套基准测试中明显优于持续预训练和强大的数据细化基线。值得注意的是,事实证明,使用我们的数据集进行 RL 训练的效率要高得多,以减少多达 100$\times$ 的代币实现连续预训练的性能。我们的工作为将 RL 扩展到预训练水平提供了一条可行的途径,从而实现更强大、更高效的语言模型。
Beneficial Reasoning Behaviors in Agentic Search and Effective Post-training to Obtain Them
代理搜索中的有益推理行为和获得它们的有效后期训练
- Authors: Jiahe Jin, Abhijay Paladugu, Chenyan Xiong
- Subjects: Subjects:
Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/2510.06534
- Pdf link: https://arxiv.org/pdf/2510.06534
- Abstract
Agentic search leverages large language models (LLMs) to interpret complex user information needs and execute a multi-step process of planning, searching, and synthesizing information to provide answers. This paradigm introduces unique challenges for LLMs' reasoning and agentic capabilities when interacting with retrieval systems and the broader web. In this paper, we propose a reasoning-driven LLM-based pipeline to study effective reasoning behavior patterns in agentic search. Using this pipeline, we analyze successful agentic search trajectories and identify four beneficial reasoning behaviors: Information Verification, Authority Evaluation, Adaptive Search, and Error Recovery. Based on these findings, we propose a technique called Behavior Priming to train more effective agentic search models. It synthesizes agentic search trajectories that exhibit these four behaviors and integrates them into the agentic search model through supervised fine-tuning (SFT), followed by standard reinforcement learning (RL). Experiments on three benchmarks (GAIA, WebWalker, and HLE) demonstrate that behavior priming yields over 35% gains in Llama3.2-3B and Qwen3-1.7B compared to directly training agentic search models with RL. Crucially, we demonstrate that the desired reasoning behaviors in the SFT data, rather than the correctness of the final answer, is the critical factor for achieving strong final performance after RL: fine-tuning on trajectories with desirable reasoning behaviors but incorrect answers leads to better performance than fine-tuning on trajectories with correct answers. Our analysis further reveals the underlying mechanism: the introduced reasoning behaviors endow models with more effective exploration (higher pass@k and entropy) and test-time scaling (longer trajectories) capabilities, providing a strong foundation for RL. Our code will be released as open source.
- 中文摘要
代理搜索利用大型语言模型 (LLM) 来解释复杂的用户信息需求,并执行规划、搜索和综合信息的多步骤过程以提供答案。这种范式给法学硕士在与检索系统和更广泛的网络交互时的推理和代理能力带来了独特的挑战。在本文中,我们提出了一种基于推理驱动的LLM管道,用于研究代理搜索中有效的推理行为模式。使用该管道,我们分析了成功的代理搜索轨迹,并确定了四种有益的推理行为:信息验证、权威评估、自适应搜索和错误恢复。基于这些发现,我们提出了一种称为行为启动的技术来训练更有效的代理搜索模型。它综合了表现出这四种行为的代理搜索轨迹,并通过监督微调(SFT)和标准强化学习(RL)将它们集成到代理搜索模型中。在三个基准测试(GAIA、WebWalker 和 HLE)上的实验表明,与直接使用 RL 训练代理搜索模型相比,行为启动在 Llama3.2-3B 和 Qwen3-1.7B 中产生了超过 35% 的收益。至关重要的是,我们证明了 SFT 数据中所需的推理行为,而不是最终答案的正确性,是 RL 后实现强大最终性能的关键因素:对具有理想推理行为但答案不正确的轨迹进行微调比对具有正确答案的轨迹进行微调会导致更好的性能。我们的分析进一步揭示了潜在的机制:引入的推理行为赋予了模型更有效的探索(更高的pass@k和熵)和测试时间缩放(更长的轨迹)能力,为RL提供了坚实的基础。我们的代码将作为开源发布。
Scalable Policy-Based RL Algorithms for POMDPs
POMDP的可扩展策略RL算法
- Authors: Ameya Anjarlekar, Rasoul Etesami, R Srikant
- Subjects: Subjects:
Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
- Arxiv link: https://arxiv.org/abs/2510.06540
- Pdf link: https://arxiv.org/pdf/2510.06540
- Abstract
The continuous nature of belief states in POMDPs presents significant computational challenges in learning the optimal policy. In this paper, we consider an approach that solves a Partially Observable Reinforcement Learning (PORL) problem by approximating the corresponding POMDP model into a finite-state Markov Decision Process (MDP) (called Superstate MDP). We first derive theoretical guarantees that improve upon prior work that relate the optimal value function of the transformed Superstate MDP to the optimal value function of the original POMDP. Next, we propose a policy-based learning approach with linear function approximation to learn the optimal policy for the Superstate MDP. Consequently, our approach shows that a POMDP can be approximately solved using TD-learning followed by Policy Optimization by treating it as an MDP, where the MDP state corresponds to a finite history. We show that the approximation error decreases exponentially with the length of this history. To the best of our knowledge, our finite-time bounds are the first to explicitly quantify the error introduced when applying standard TD learning to a setting where the true dynamics are not Markovian.
- 中文摘要
POMDP 中信念状态的连续性给学习最优策略带来了重大的计算挑战。在本文中,我们考虑了一种通过将相应的 POMDP 模型近似为有限状态马尔可夫决策过程 (MDP)(称为超状态 MDP)来解决部分可观察强化学习 (PORL) 问题的方法。我们首先得出了理论保证,这些保证改进了先前的工作,这些工作将转换后的超状态 MDP 的最优值函数与原始 POMDP 的最优值函数相关联。接下来,我们提出了一种基于策略的学习方法,采用线性函数近似来学习超状态MDP的最优策略。因此,我们的方法表明,POMDP 可以通过将 TD 学习和策略优化视为 MDP 来近似求解,其中 MDP 状态对应于有限历史。我们表明,近似误差随着该历史的长度呈指数级减小。据我们所知,我们的有限时间界限是第一个明确量化将标准 TD 学习应用于真实动力学不是马尔可夫的环境时引入的误差的。
Incoherence in goal-conditioned autoregressive models
目标条件自回归模型中的不连贯性
- Authors: Jacek Karwowski, Raymond Douglas
- Subjects: Subjects:
Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2510.06545
- Pdf link: https://arxiv.org/pdf/2510.06545
- Abstract
We investigate mathematically the notion of incoherence: a structural issue with reinforcement learning policies derived by naive goal-conditioning of autoregressive models. We focus on the process of re-training models on their own actions, that is, fine-tuning offline-learned policies with online RL. We prove that it decreases incoherence and leads to an improvement in return, and we aim to characterize the resulting trajectory of policies. By re-framing standard notions of control-as-inference and soft Q learning, we establish a three-way correspondence with two other ways of understanding the iterative re-training process: as folding the posterior into the reward and, in the deterministic case, as decreasing the temperature parameter; the correspondence has computational content via the training-inference trade-off. Through soft-conditioning generative models, we discuss the link between incoherence and the effective horizon.
- 中文摘要
我们从数学上研究了不连贯性的概念:由自回归模型的朴素目标条件反射得出的强化学习策略的结构性问题。我们专注于根据模型自身的行为重新训练模型的过程,即使用在线 RL 微调离线学习的策略。我们证明它减少了不连贯性并导致回报的改善,我们的目标是描述由此产生的政策轨迹。通过重新构建控制即推理和软 Q 学习的标准概念,我们与理解迭代再训练过程的另外两种方式建立了三向对应关系:将后验折叠为奖励,在确定性情况下,减小温度参数;对应关系通过训练-推理权衡具有计算内容。通过软调节生成模型,我们讨论了不连贯性与有效视界之间的联系。
The Markovian Thinker
马尔可夫思想家
- Authors: Milad Aghajohari, Kamran Chitsaz, Amirhossein Kazemnejad, Sarath Chandar, Alessandro Sordoni, Aaron Courville, Siva Reddy
- Subjects: Subjects:
Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
- Arxiv link: https://arxiv.org/abs/2510.06557
- Pdf link: https://arxiv.org/pdf/2510.06557
- Abstract
Reinforcement learning (RL) has recently become a strong recipe for training reasoning LLMs that produce long chains of thought (LongCoT). Yet the standard RL "thinking environment", where the state is the prompt plus all prior reasoning tokens, makes the state unbounded and forces attention-based policies to pay quadratic compute as thoughts lengthen. We revisit the environment itself. We propose Markovian Thinking, a paradigm in which the policy advances reasoning while conditioning on a constant-size state, decoupling thinking length from context size. As an immediate consequence this yields linear compute with constant memory. We instantiate this idea with Delethink, an RL environment that structures reasoning into fixed-size chunks. Within each chunk, the model thinks as usual; at the boundary, the environment resets the context and reinitializes the prompt with a short carryover. Through RL, the policy learns to write a textual state near the end of each chunk sufficient for seamless continuation of reasoning after reset. Trained in this environment, an R1-Distill 1.5B model reasons in 8K-token chunks yet thinks up to 24K tokens, matching or surpassing LongCoT-RL trained with a 24K budget. With test-time scaling, Delethink continues to improve where LongCoT plateaus. The effect of linear compute is substantial: we empirically estimate at 96K average thinking length LongCoT-RL costs 27 H100-months vs. 7 for Delethink. Analysis at RL initialization shows off-the-shelf reasoning models (1.5B-120B) often sample Markovian traces zero-shot across diverse benchmarks, providing positive samples that make RL effective at scale. Our results show that redesigning the thinking environment is a powerful lever: it enables very long reasoning without quadratic overhead and opens a path toward efficient, scalable reasoning LLMs.
- 中文摘要
强化学习 (RL) 最近已成为训练产生长思维链 (LongCoT) 的推理法学硕士的有力秘诀。然而,标准的 RL“思维环境”,其中状态是提示加上所有先前的推理标记,使状态变得无界,并随着思想的延长而迫使基于注意力的策略支付二次计算。我们重新审视环境本身。我们提出了马尔可夫思维,这是一种范式,在这种范式中,政策在以恒定大小状态为条件的同时推进推理,将思维长度与上下文大小解耦。作为直接结果,这会产生具有恒定内存的线性计算。我们用 Delethink 实例化了这个想法,Delethink 是一个 RL 环境,它将推理结构化为固定大小的块。在每个块中,模型照常思考;在边界处,环境会重置上下文,并使用短暂的结转重新初始化提示。通过 RL,策略学会在每个块的末尾附近写入一个文本状态,足以在重置后无缝继续推理。在这种环境中训练,R1-Distill 1.5B 模型推理 8K 代币块,但思考高达 24K 代币,与 24K 预算训练的 LongCoT-RL 相当或超过。通过测试时间缩放,Delethink 继续改进 LongCoT 趋于平稳的地方。线性计算的效果是巨大的:我们根据经验估计,在 96K 平均思维长度下,LongCoT-RL 的成本为 27 H100 个月,而 Delethink 为 7。RL 初始化时的分析表明,现成的推理模型 (1.5B-120B) 经常在不同的基准上对马尔可夫迹线进行零样本采样,从而提供使 RL 大规模有效的正样本。我们的结果表明,重新设计思维环境是一个强大的杠杆:它可以实现没有二次开销的超长推理,并为高效、可扩展的推理法学硕士开辟了一条道路。
Aligning Large Language Models via Fully Self-Synthetic Data
通过完全自合成的数据对齐大型语言模型
- Authors: Shangjian Yin, Zhepei Wei, Xinyu Zhu, Wei-Lin Chen, Yu Meng
- Subjects: Subjects:
Computation and Language (cs.CL)
- Arxiv link: https://arxiv.org/abs/2510.06652
- Pdf link: https://arxiv.org/pdf/2510.06652
- Abstract
Traditional reinforcement learning from human feedback (RLHF) for large language models (LLMs) relies on expensive human-annotated datasets, while Reinforcement Learning from AI Feedback (RLAIF) also incurs significant costs, requiring the collection of diverse prompts and corresponding responses, often necessitating external reward models or proprietary models like GPT-4 to annotate preference pairs. In this work, we introduce Self-Alignment Optimization (SAO), a fully self-synthetic framework for LLM alignment, where all training data, including prompts (i.e., user queries), responses, and preferences, are generated by the model itself. Specifically, SAO first instructs the LLM to engage in persona role-play and generate diverse prompts and responses, which are then self-evaluated for preference optimization. Extensive experiments demonstrate that SAO effectively enhances the model's chat capabilities on standard benchmarks like AlpacaEval~2.0, while maintaining strong performance on downstream objective tasks (e.g., question-answering, math reasoning). Our work provides a practical solution for self-improvement in aligning LLMs, and the code for reproducing our results is available at: this https URL.
- 中文摘要
大型语言模型 (LLM) 的传统人类反馈强化学习 (RLHF) 依赖于昂贵的人类注释数据集,而人工智能反馈强化学习 (RLAIF) 也会产生大量成本,需要收集不同的提示和相应的响应,通常需要外部奖励模型或 GPT-4 等专有模型来注释偏好对。在这项工作中,我们引入了自对齐优化(SAO),这是一个用于LLM对齐的完全自合成框架,其中所有训练数据,包括提示(即用户查询)、响应和偏好,都是由模型本身生成的。具体来说,SAO 首先指示 LLM 进行角色扮演并生成多样化的提示和响应,然后对其进行自我评估以进行偏好优化。大量实验表明,SAO 在 AlpacaEval~2.0 等标准基准测试上有效地增强了模型的聊天能力,同时在下游客观任务(例如问答、数学推理)上保持了强大的性能。我们的工作为调整 LLM 的自我改进提供了一个实用的解决方案,用于重现我们结果的代码可在以下位置获得:此 https URL。
PIKA: Expert-Level Synthetic Datasets for Post-Training Alignment from Scratch
PIKA:用于从头开始训练后对齐的专家级合成数据集
- Authors: Shangjian Yin, Shining Liang, Wenbiao Ding, Yuli Qian, Zhouxing Shi, Hongzhi Li, Yutao Xie
- Subjects: Subjects:
Computation and Language (cs.CL)
- Arxiv link: https://arxiv.org/abs/2510.06670
- Pdf link: https://arxiv.org/pdf/2510.06670
- Abstract
Reinforcement Learning from Human Feedback (RLHF) has become a cornerstone for aligning large language models (LLMs). However, its effectiveness depends on high-quality instruction data. Most existing alignment datasets are either private or require costly human annotation, which limits reproducibility and scalability. Even with Reinforcement Learning from AI Feedback (RLAIF), concerns about data quality remain. Moreover, it is unclear how much data is actually required to fine-tune a base model into a strong instruction-following model. Current approaches often rely on over 300k examples even at the supervised fine-tuning (SFT) stage, yet they still underperform compared to proprietary models, creating barriers for academic and resource-limited communities. To address this gap, we introduce PiKa, a data-efficient family of expert-level alignment datasets. In particular, the PiKa-SFT dataset uses only 30k SFT examples, far fewer than state-of-the-art datasets like Magpie. Through evaluations by fine-tuning Llama-3-8B-Base on PiKa and other public datasets, we show that PiKa-SFT outperforms models trained on much larger data. On AlpacaEval 2.0 and Arena-Hard benchmarks, PiKa-SFT fine-tuning even surpasses the official Llama-3-8B-Instruct model trained on over 10 million proprietary examples. We further extend our study by training the Qwen2.5 series (0.5B to 7B) on PiKa-SFT, achieving consistent gains. These findings demonstrate that high-quality alignment can be achieved with significantly less data, offering a scalable path for open-source LLM alignment. Code and data: this https URL.
- 中文摘要
人类反馈强化学习 (RLHF) 已成为调整大型语言模型 (LLM) 的基石。然而,其有效性取决于高质量的指令数据。大多数现有的比对数据集要么是私有的,要么需要昂贵的人工注释,这限制了可重复性和可扩展性。即使使用人工智能反馈强化学习 (RLAIF),对数据质量的担忧仍然存在。此外,尚不清楚将基础模型微调为强指令遵循模型实际上需要多少数据。即使在监督微调 (SFT) 阶段,当前的方法通常也依赖于 300 多个示例,但与专有模型相比,它们仍然表现不佳,为学术和资源有限的社区造成了障碍。为了解决这一差距,我们推出了 PiKa,这是一个数据高效的专家级对齐数据集系列。特别是,PiKa-SFT 数据集仅使用 30k SFT 示例,远少于 Magpie 等最先进的数据集。通过在 PiKa 和其他公共数据集上微调 Llama-3-8B-Base 进行评估,我们表明 PiKa-SFT 的性能优于在更大数据上训练的模型。在 AlpacaEval 2.0 和 Arena-Hard 基准测试中,PiKa-SFT 微调甚至超过了在超过 1000 万个专有示例上训练的官方 Llama-3-8B-Instruct 模型。我们通过在PiKa-SFT上训练Qwen2.5系列(0.5B至7B)进一步扩展了我们的研究,取得了一致的收益。这些发现表明,只需更少的数据即可实现高质量的对齐,为开源 LLM 对齐提供了可扩展的路径。代码和数据:此 https URL。
XRPO: Pushing the limits of GRPO with Targeted Exploration and Exploitation
XRPO:通过有针对性的勘探和开发突破 GRPO 的极限
- Authors: Udbhav Bamba, Minghao Fang, Yifan Yu, Haizhong Zheng, Fan Lai
- Subjects: Subjects:
Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/2510.06672
- Pdf link: https://arxiv.org/pdf/2510.06672
- Abstract
Reinforcement learning algorithms such as GRPO have driven recent advances in large language model (LLM) reasoning. While scaling the number of rollouts stabilizes training, existing approaches suffer from limited exploration on challenging prompts and leave informative feedback signals underexploited, due to context-independent rollout allocation across prompts (e.g., generating 16 rollouts per prompt) and relying heavily on sparse rewards. This paper presents XRPO(eXplore - eXploit GRPO), a unified framework that recasts policy optimization through the principled lens of rollout exploration-exploitation. To enhance exploration, XRPO introduces a mathematically grounded rollout allocator that adaptively prioritizes prompts with higher potential for uncertainty reduction. It further addresses stagnation on zero-reward prompts through an in-context seeding strategy that injects curated exemplars, steering the model into more difficult reasoning trajectories. To strengthen exploitation, XRPO develops a group-relative, novelty-aware advantage sharpening mechanism that leverages sequence likelihoods to amplify low-probability yet correct responses, thereby extending the policy's reach beyond sparse rewards. Experiments across diverse math and coding benchmarks on both reasoning and non-reasoning models demonstrate that XRPO outperforms existing advances (e.g., GRPO and GSPO) up to 4% pass@1 and 6% cons@32, while accelerating training convergence by up to 2.7X.
- 中文摘要
GRPO 等强化学习算法推动了大型语言模型 (LLM) 推理的最新进展。虽然扩展推出次数可以稳定训练,但现有方法对具有挑战性的提示的探索有限,并且由于跨提示的上下文独立于推出分配(例如,每个提示生成 16 个推出)并且严重依赖稀疏奖励,现有方法对具有挑战性的提示的探索有限,并且信息反馈信号未得到充分利用。本文介绍了XRPO(eXplore - eXploit GRPO),这是一个统一的框架,通过推出探索-开发的原则视角重塑了策略优化。为了加强探索,XRPO 引入了一个基于数学的推出分配器,该分配器可以自适应地优先考虑具有更高不确定性减少潜力的提示。它通过注入精选示例的上下文播种策略进一步解决零奖励提示的停滞问题,将模型引导到更困难的推理轨迹。为了加强利用,XRPO 开发了一种群体相对、新颖性感知的优势锐化机制,该机制利用序列似然来放大低概率但正确的响应,从而将政策的范围扩展到稀疏奖励之外。在推理和非推理模型上跨越各种数学和编码基准的实验表明,XRPO 的性能优于现有进步(例如 GRPO 和 GSPO),pass@1高达 4%,cons@32为 6%,同时将训练收敛速度提高了 2.7 倍。
REACH: Reinforcement Learning for Adaptive Microservice Rescheduling in the Cloud-Edge Continuum
REACH:云边缘连续体中自适应微服务重新调度的强化学习
- Authors: Xu Bai, Muhammed Tawfiqul Islam, Rajkumar Buyya, Adel N. Toosi
- Subjects: Subjects:
Distributed, Parallel, and Cluster Computing (cs.DC)
- Arxiv link: https://arxiv.org/abs/2510.06675
- Pdf link: https://arxiv.org/pdf/2510.06675
- Abstract
Cloud computing, despite its advantages in scalability, may not always fully satisfy the low-latency demands of emerging latency-sensitive pervasive applications. The cloud-edge continuum addresses this by integrating the responsiveness of edge resources with cloud scalability. Microservice Architecture (MSA) characterized by modular, loosely coupled services, aligns effectively with this continuum. However, the heterogeneous and dynamic computing resource poses significant challenges to the optimal placement of microservices. We propose REACH, a novel rescheduling algorithm that dynamically adapts microservice placement in real time using reinforcement learning to react to fluctuating resource availability, and performance variations across distributed infrastructures. Extensive experiments on a real-world testbed demonstrate that REACH reduces average end-to-end latency by 7.9%, 10%, and 8% across three benchmark MSA applications, while effectively mitigating latency fluctuations and spikes.
- 中文摘要
尽管云计算在可扩展性方面具有优势,但可能并不总是完全满足新兴的对延迟敏感的普遍应用程序的低延迟需求。云边缘连续体通过将边缘资源的响应能力与云可扩展性集成来解决这个问题。微服务架构 (MSA) 以模块化、松散耦合的服务为特征,与这一连续体有效保持一致。然而,异构和动态的计算资源对微服务的优化布局提出了重大挑战。我们提出了 REACH,这是一种新颖的重新调度算法,它使用强化学习实时动态调整微服务放置,以对分布式基础设施中波动的资源可用性和性能变化做出反应。在真实世界测试平台上进行的大量实验表明,REACH 在三个基准 MSA 应用程序中将平均端到端延迟降低了 7.9%、10% 和 8%,同时有效缓解了延迟波动和峰值。
RLinf-VLA: A Unified and Efficient Framework for VLA+RL Training
RLinf-VLA:VLA+RL训练的统一高效框架
- Authors: Hongzhi Zang, Mingjie Wei, Si Xu, Yongji Wu, Zhen Guo, Yuanqing Wang, Hao Lin, Liangzhi Shi, Yuqing Xie, Zhexuan Xu, Zhihao Liu, Kang Chen, Wenhao Tang, Quanlu Zhang, Weinan Zhang, Chao Yu, Yu Wang
- Subjects: Subjects:
Robotics (cs.RO)
- Arxiv link: https://arxiv.org/abs/2510.06710
- Pdf link: https://arxiv.org/pdf/2510.06710
- Abstract
Recent progress in vision and language foundation models has significantly advanced multimodal understanding, reasoning, and generation, inspiring a surge of interest in extending such capabilities to embodied settings through vision-language-action (VLA) models. Yet, most VLA models are still trained with supervised fine-tuning (SFT), which struggles to generalize under distribution shifts due to error accumulation. Reinforcement learning (RL) offers a promising alternative by directly optimizing task performance through interaction, but existing attempts remain fragmented and lack a unified platform for fair and systematic comparison across model architectures and algorithmic designs. To address this gap, we introduce RLinf-VLA, a unified and efficient framework for scalable RL training of VLA models. The system adopts a highly flexible resource allocation design that addresses the challenge of integrating rendering, training, and inference in RL+VLA training. In particular, for GPU-parallelized simulators, RLinf-VLA implements a novel hybrid fine-grained pipeline allocation mode, achieving a 1.61x-1.88x speedup in training. Through a unified interface, RLinf-VLA seamlessly supports diverse VLA architectures (e.g., OpenVLA, OpenVLA-OFT), multiple RL algorithms (e.g., PPO, GRPO), and various simulators (e.g., ManiSkill, LIBERO). In simulation, a unified model achieves 98.11\% across 130 LIBERO tasks and 97.66\% across 25 ManiSkill tasks. Beyond empirical performance, our study distills a set of best practices for applying RL to VLA training and sheds light on emerging patterns in this integration. Furthermore, we present preliminary deployment on a real-world Franka robot, where RL-trained policies exhibit stronger generalization than those trained with SFT. We envision RLinf-VLA as a foundation to accelerate and standardize research on embodied intelligence.
- 中文摘要
视觉和语言基础模型的最新进展显着推进了多模态理解、推理和生成,激发了人们对通过视觉-语言-行动 (VLA) 模型将此类功能扩展到具身环境的兴趣激增。然而,大多数VLA模型仍然使用监督微调(SFT)进行训练,由于误差累积,在分布偏移下难以泛化。强化学习(RL)通过交互直接优化任务性能,提供了一种有前途的替代方案,但现有的尝试仍然分散,缺乏统一的平台来在模型架构和算法设计之间进行公平和系统的比较。为了解决这一差距,我们引入了 RLinf-VLA,这是一个统一且高效的框架,用于对 VLA 模型进行可扩展的 RL 训练。该系统采用高度灵活的资源分配设计,解决了RL+VLA训练中渲染、训练和推理一体化的挑战。特别是,对于GPU并行化模拟器,RLinf-VLA实现了一种新颖的混合细粒度流水线分配模式,实现了1.61x-1.88x的训练加速。通过统一的接口,RLinf-VLA无缝支持多种VLA架构(如OpenVLA、OpenVLA-OFT)、多种RL算法(如PPO、GRPO)和各种模拟器(如ManiSkill、LIBERO)。在模拟中,统一模型在 130 个 LIBERO 任务中实现了 98.11\%,在 25 个 ManiSkill 任务中实现了 97.66\%。除了实证性能之外,我们的研究还提炼了一套将 RL 应用于 VLA 训练的最佳实践,并阐明了这种整合中的新模式。此外,我们还介绍了在现实世界的 Franka 机器人上的初步部署,其中 RL 训练的策略比使用 SFT 训练的策略表现出更强的泛化性。我们将 RLinf-VLA 设想为加速和标准化具身智能研究的基础。
Dual Goal Representations
双目标表示
- Authors: Seohong Park, Deepinder Mann, Sergey Levine
- Subjects: Subjects:
Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2510.06714
- Pdf link: https://arxiv.org/pdf/2510.06714
- Abstract
In this work, we introduce dual goal representations for goal-conditioned reinforcement learning (GCRL). A dual goal representation characterizes a state by "the set of temporal distances from all other states"; in other words, it encodes a state through its relations to every other state, measured by temporal distance. This representation provides several appealing theoretical properties. First, it depends only on the intrinsic dynamics of the environment and is invariant to the original state representation. Second, it contains provably sufficient information to recover an optimal goal-reaching policy, while being able to filter out exogenous noise. Based on this concept, we develop a practical goal representation learning method that can be combined with any existing GCRL algorithm. Through diverse experiments on the OGBench task suite, we empirically show that dual goal representations consistently improve offline goal-reaching performance across 20 state- and pixel-based tasks.
- 中文摘要
在这项工作中,我们引入了目标条件强化学习(GCRL)的双目标表示。双目标表示通过“与所有其他状态的时间距离集”来表征状态;换句话说,它通过状态与所有其他状态的关系来编码一个状态,以时间距离来衡量。这种表示提供了几个吸引人的理论属性。首先,它仅取决于环境的内在动力学,并且对原始状态表示不变。其次,它包含可证明的足够信息来恢复最佳目标实现策略,同时能够过滤掉外生噪声。基于这一概念,我们开发了一种实用的目标表示学习方法,可以与任何现有的GCRL算法相结合。通过对 OGBench 任务套件的各种实验,我们实证表明,双目标表示在 20 个基于状态和像素的任务中持续提高离线目标实现性能。
Scaling LLM Multi-turn RL with End-to-end Summarization-based Context Management
通过基于端到端摘要的上下文管理扩展 LLM 多轮 RL
- Authors: Miao Lu, Weiwei Sun, Weihua Du, Zhan Ling, Xuesong Yao, Kang Liu, Jiecao Chen
- Subjects: Subjects:
Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/2510.06727
- Pdf link: https://arxiv.org/pdf/2510.06727
- Abstract
We study reinforcement learning (RL) fine-tuning of large language model (LLM) agents for long-horizon multi-turn tool use, where context length quickly becomes a fundamental bottleneck. Existing RL pipelines can suffer from degraded instruction following, excessive rollout costs, and most importantly, strict context limits. To address these challenges, we introduce summarization-based context management to training. In specific, it periodically compresses the tool using history by LLM-generated summaries that retain task-relevant information to keep a compact context while enabling the agent to scale beyond the fixed context window. Building on this formulation, we derive a policy gradient representation that seamlessly enables standard LLM RL infrastructures to optimize both tool-use behaviors as well as summarization strategies in an end-to-end fashion. We instantiate this framework with \underline{SU}mmarization augmented \underline{P}olicy \underline{O}ptimization (\texttt{SUPO}), an LLM RL algorithm that enables long-horizon training beyond a fixed context limit. Experiments on interactive function calling and searching tasks demonstrate that \texttt{SUPO} significantly improves the success rate while maintaining the same or even lower working context length compared to baselines. We also demonstrate that for complex searching tasks, \texttt{SUPO} can further improve the evaluation performance when scaling test-time maximum round of summarization beyond that of training time. Our results establish summarization-based context management as a principled and scalable approach for training RL agents beyond a fixed context length limit.
- 中文摘要
我们研究了大型语言模型 (LLM) 代理的强化学习 (RL) 微调,用于长视野多轮工具的使用,其中上下文长度很快成为基本瓶颈。现有的 RL 管道可能会受到指令遵循降级、推出成本过高以及最重要的是严格的上下文限制的影响。为了应对这些挑战,我们在培训中引入了基于摘要的上下文管理。具体来说,它通过 LLM 生成的摘要使用历史记录定期压缩工具,这些摘要保留与任务相关的信息以保持紧凑的上下文,同时使代理能够扩展到固定的上下文窗口之外。基于此公式,我们推导出了一个策略梯度表示,该表示使标准 LLM RL 基础设施能够无缝地以端到端的方式优化工具使用行为和汇总策略。我们使用 \underline{SU}mmarization augmented \underline{P}olicy \underline{O}ptimization (\texttt{SUPO}) 实例化该框架,这是一种 LLM RL 算法,可实现超出固定上下文限制的长期训练。交互式函数调用和搜索任务的实验表明,与基线相比,\texttt{SUPO} 在保持相同甚至更低的工作上下文长度的同时显着提高了成功率。我们还证明,对于复杂的搜索任务,\texttt{SUPO} 可以进一步提高将测试时间最大汇总轮数扩展到训练时间之外时的评估性能。我们的结果将基于摘要的上下文管理确立为一种有原则且可扩展的方法,用于训练超出固定上下文长度限制的 RL 代理。
AWM: Accurate Weight-Matrix Fingerprint for Large Language Models
AWM:用于大型语言模型的准确权重矩阵指纹
- Authors: Boyi Zeng, Lin Chen, Ziwei He, Xinbing Wang, Zhouhan Lin
- Subjects: Subjects:
Computation and Language (cs.CL)
- Arxiv link: https://arxiv.org/abs/2510.06738
- Pdf link: https://arxiv.org/pdf/2510.06738
- Abstract
Protecting the intellectual property of large language models (LLMs) is crucial, given the substantial resources required for their training. Consequently, there is an urgent need for both model owners and third parties to determine whether a suspect LLM is trained from scratch or derived from an existing base model. However, the intensive post-training processes that models typically undergo-such as supervised fine-tuning, extensive continued pretraining, reinforcement learning, multi-modal extension, pruning, and upcycling-pose significant challenges to reliable identification. In this work, we propose a training-free fingerprinting method based on weight matrices. We leverage the Linear Assignment Problem (LAP) and an unbiased Centered Kernel Alignment (CKA) similarity to neutralize the effects of parameter manipulations, yielding a highly robust and high-fidelity similarity metric. On a comprehensive testbed of 60 positive and 90 negative model pairs, our method demonstrates exceptional robustness against all six aforementioned post-training categories while exhibiting a near-zero risk of false positives. By achieving perfect scores on all classification metrics, our approach establishes a strong basis for reliable model lineage verification. Moreover, the entire computation completes within 30s on an NVIDIA 3090 GPU. The code is available at this https URL.
- 中文摘要
鉴于大型语言模型 (LLM) 的训练需要大量资源,保护大型语言模型 (LLM) 的知识产权至关重要。因此,模型所有者和第三方都迫切需要确定可疑的 LLM 是从头开始训练的还是从现有的基础模型派生的。然而,模型通常经历的密集后训练过程——例如监督微调、广泛的持续预训练、强化学习、多模态扩展、修剪和升级再造——对可靠识别提出了重大挑战。在这项工作中,我们提出了一种基于权重矩阵的免训练指纹识别方法。我们利用线性赋值问题 (LAP) 和无偏中心核对齐 (CKA) 相似性来抵消参数作的影响,从而产生高度稳健和高保真度的相似性指标。在包含 60 个阳性模型和 90 个阴性模型对的综合测试平台上,我们的方法对上述所有六个训练后类别都表现出卓越的鲁棒性,同时表现出接近零的误报风险。通过在所有分类指标上取得满分,我们的方法为可靠的模型谱系验证奠定了坚实的基础。此外,整个计算在 NVIDIA 30 GPU 上在 30 秒内完成。该代码可在此 https URL 中找到。
Verifying Memoryless Sequential Decision-making of Large Language Models
验证大型语言模型的无记忆顺序决策
- Authors: Dennis Gross, Helge Spieker, Arnaud Gotlieb
- Subjects: Subjects:
Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2510.06756
- Pdf link: https://arxiv.org/pdf/2510.06756
- Abstract
We introduce a tool for rigorous and automated verification of large language model (LLM)- based policies in memoryless sequential decision-making tasks. Given a Markov decision process (MDP) representing the sequential decision-making task, an LLM policy, and a safety requirement expressed as a PCTL formula, our approach incrementally constructs only the reachable portion of the MDP guided by the LLM's chosen actions. Each state is encoded as a natural language prompt, the LLM's response is parsed into an action, and reachable successor states by the policy are expanded. The resulting formal model is checked with Storm to determine whether the policy satisfies the specified safety property. In experiments on standard grid world benchmarks, we show that open source LLMs accessed via Ollama can be verified when deterministically seeded, but generally underperform deep reinforcement learning baselines. Our tool natively integrates with Ollama and supports PRISM-specified tasks, enabling continuous benchmarking in user-specified sequential decision-making tasks and laying a practical foundation for formally verifying increasingly capable LLMs.
- 中文摘要
我们介绍了一种工具,用于在无记忆顺序决策任务中对基于大型语言模型 (LLM) 的策略进行严格和自动化的验证。给定代表顺序决策任务的马尔可夫决策过程 (MDP)、LLM 策略以及以 PCTL 公式表示的安全要求,我们的方法仅以 LLM 选择的作为指导的 MDP 的可访问部分进行增量构建。每个状态都编码为自然语言提示,LLM 的响应被解析为一个作,并且策略可访问的后续状态被扩展。使用 Storm 检查生成的形式模型,以确定策略是否满足指定的安全属性。在标准网格世界基准测试的实验中,我们表明,通过 Ollama 访问的开源 LLM 在确定性种子时可以得到验证,但通常表现不佳于深度强化学习基线。我们的工具与 Ollama 原生集成,支持 PRISM 指定的任务,从而能够在用户指定的顺序决策任务中进行持续的基准测试,并为正式验证能力越来越强的 LLM 奠定实际基础。
TTRV: Test-Time Reinforcement Learning for Vision Language Models
TTRV:视觉语言模型的测试时间强化学习
- Authors: Akshit Singh, Shyam Marjit, Wei Lin, Paul Gavrikov, Serena Yeung-Levy, Hilde Kuehne, Rogerio Feris, Sivan Doveh, James Glass, M. Jehanzeb Mirza
- Subjects: Subjects:
Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/2510.06783
- Pdf link: https://arxiv.org/pdf/2510.06783
- Abstract
Existing methods for extracting reward signals in Reinforcement Learning typically rely on labeled data and dedicated training splits, a setup that contrasts with how humans learn directly from their environment. In this work, we propose TTRV to enhance vision language understanding by adapting the model on the fly at inference time, without the need for any labeled data. Concretely, we enhance the Group Relative Policy Optimization (GRPO) framework by designing rewards based on the frequency of the base model's output, while inferring on each test sample multiple times. Further, we also propose to control the diversity of the model's output by simultaneously rewarding the model for obtaining low entropy of the output empirical distribution. Our approach delivers consistent gains across both object recognition and visual question answering (VQA), with improvements of up to 52.4% and 29.8%, respectively, and average boosts of 24.6% and 10.0% across 16 this http URL, on image recognition, TTRV applied to InternVL 8B surpasses GPT-4o by an average of 2.3% over 8 benchmarks, while remaining highly competitive on VQA, demonstrating that test-time reinforcement learning can match or exceed the strongest proprietary models. Finally, we find many interesting properties of test-time RL for VLMs: for example, even in extremely data-constrained scenarios, where adaptation is performed on a single randomly chosen unlabeled test example, TTRV still yields non-trivial improvements of up to 5.5% in recognition tasks.
- 中文摘要
在强化学习中提取奖励信号的现有方法通常依赖于标记数据和专用的训练拆分,这种设置与人类直接从环境中学习的方式形成鲜明对比。在这项工作中,我们提出了TTRV通过在推理时动态调整模型来增强视觉语言理解,而无需任何标记数据。具体来说,我们通过根据基础模型输出的频率设计奖励,同时对每个测试样本进行多次推断,来增强群体相对策略优化(GRPO)框架。此外,我们还提出通过同时奖励模型获得输出经验分布的低熵来控制模型输出的多样性。我们的方法在对象识别和视觉问答 (VQA) 方面提供了一致的收益,分别提高了 52.4% 和 29.8%,在 16 个 http URL 中平均提高了 24.6% 和 10.0%,在图像识别方面,应用于 InternVL 8B 的 TTRV 在 8 个基准测试中平均超过 GPT-4o 2.3%,同时在 VQA 上保持高度竞争力, 证明测试时强化学习可以匹配或超过最强的专有模型。最后,我们发现VLM的测试时RL具有许多有趣的特性:例如,即使在数据极度受限的场景中,在单个随机选择的未标记测试示例上进行调整,TTRV仍然在识别任务中产生高达5.5%的不平凡的改进。
$λ$-GRPO: Unifying the GRPO Frameworks with Learnable Token Preferences
$λ$-GRPO:将 GRPO 框架与可学习的代币偏好统一起来
- Authors: Yining Wang, Jinman Zhao, Chuangxin Zhao, Shuhao Guan, Gerald Penn, Shinan Liu
- Subjects: Subjects:
Computation and Language (cs.CL)
- Arxiv link: https://arxiv.org/abs/2510.06870
- Pdf link: https://arxiv.org/pdf/2510.06870
- Abstract
Reinforcement Learning with Human Feedback (RLHF) has been the dominant approach for improving the reasoning capabilities of Large Language Models (LLMs). Recently, Reinforcement Learning with Verifiable Rewards (RLVR) has simplified this paradigm by replacing the reward and value models with rule-based verifiers. A prominent example is Group Relative Policy Optimization (GRPO). However, GRPO inherently suffers from a length bias, since the same advantage is uniformly assigned to all tokens of a response. As a result, longer responses distribute the reward over more tokens and thus contribute disproportionately to gradient updates. Several variants, such as DAPO and Dr. GRPO, modify the token-level aggregation of the loss, yet these methods remain heuristic and offer limited interpretability regarding their implicit token preferences. In this work, we explore the possibility of allowing the model to learn its own token preference during optimization. We unify existing frameworks under a single formulation and introduce a learnable parameter $\lambda$ that adaptively controls token-level weighting. We use $\lambda$-GRPO to denote our method, and we find that $\lambda$-GRPO achieves consistent improvements over vanilla GRPO and DAPO on multiple mathematical reasoning benchmarks. On Qwen2.5 models with 1.5B, 3B, and 7B parameters, $\lambda$-GRPO improves average accuracy by $+1.9\%$, $+1.0\%$, and $+1.7\%$ compared to GRPO, respectively. Importantly, these gains come without any modifications to the training data or additional computational cost, highlighting the effectiveness and practicality of learning token preferences.
- 中文摘要
人类反馈强化学习 (RLHF) 一直是提高大型语言模型 (LLM) 推理能力的主要方法。最近,具有可验证奖励的强化学习(RLVR)通过用基于规则的验证器取代奖励和价值模型来简化这种范式。一个突出的例子是组相对策略优化 (GRPO)。然而,GRPO 本质上存在长度偏差,因为相同的优势被统一分配给响应的所有标记。因此,较长的响应会将奖励分配给更多代币,从而对梯度更新做出不成比例的贡献。DAPO 和 Dr. GRPO 等几种变体修改了损失的代币级聚合,但这些方法仍然是启发式的,并且对其隐式代币偏好的可解释性有限。在这项工作中,我们探索了允许模型在优化过程中学习自己的标记偏好的可能性。我们将现有框架统一在单一公式下,并引入了一个可学习的参数 $\lambda$,该参数自适应地控制代币级权重。我们用 $\lambda$-GRPO 来表示我们的方法,我们发现 $\lambda$-GRPO 在多个数学推理基准上比普通 GRPO 和 DAPO 取得了一致的改进。在具有 1.5B、3B 和 7B 参数的 Qwen2.5 模型上,与 GRPO 相比,$\lambda$-GRPO 的平均准确率分别提高了 $+1.9\%$、$+1.0\%$ 和 $+1.7\%$。重要的是,这些收益无需对训练数据进行任何修改或额外的计算成本,凸显了学习代币偏好的有效性和实用性。
SaFeR-VLM: Toward Safety-aware Fine-grained Reasoning in Multimodal Models
SaFeR-VLM:在多模态模型中走向安全感知的细粒度推理
- Authors: Huahui Yi, Kun Wang, Qiankun Li, Miao Yu, Liang Lin, Gongli Xi, Hao Wu, Xuming Hu, Kang Li, Yang Liu
- Subjects: Subjects:
Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/2510.06871
- Pdf link: https://arxiv.org/pdf/2510.06871
- Abstract
Multimodal Large Reasoning Models (MLRMs) demonstrate impressive cross-modal reasoning but often amplify safety risks under adversarial or unsafe prompts, a phenomenon we call the \textit{Reasoning Tax}. Existing defenses mainly act at the output level and do not constrain the reasoning process, leaving models exposed to implicit risks. In this paper, we propose SaFeR-VLM, a safety-aligned reinforcement learning framework that embeds safety directly into multimodal reasoning. The framework integrates four components: (I) QI-Safe-10K, a curated dataset emphasizing safety-critical and reasoning-sensitive cases; (II) safety-aware rollout, where unsafe generations undergo reflection and correction instead of being discarded; (III) structured reward modeling with multi-dimensional weighted criteria and explicit penalties for hallucinations and contradictions; and (IV) GRPO optimization, which reinforces both safe and corrected trajectories. This unified design shifts safety from a passive safeguard to an active driver of reasoning, enabling scalable and generalizable safety-aware reasoning. SaFeR-VLM further demonstrates robustness against both explicit and implicit risks, supporting dynamic and interpretable safety decisions beyond surface-level filtering. SaFeR-VLM-3B achieves average performance $70.13$ and $78.97$ on safety and helpfulness across six benchmarks, surpassing both same-scale and $>10\times$ larger models such as Skywork-R1V3-38B, Qwen2.5VL-72B, and GLM4.5V-106B. Remarkably, SaFeR-VLM-7B benefits from its increased scale to surpass GPT-5-mini and Gemini-2.5-Flash by \num{6.47} and \num{16.76} points respectively on safety metrics, achieving this improvement without any degradation in helpfulness performance. Our codes are available at this https URL.
- 中文摘要
多模态大型推理模型(MLRM)展示了令人印象深刻的跨模态推理,但在对抗性或不安全的提示下往往会放大安全风险,我们将这种现象称为 \textit{Reasoning Tax}。现有的防御主要作用于输出层面,不约束推理过程,使模型面临隐性风险。在本文中,我们提出了 SaFeR-VLM,这是一种安全对齐的强化学习框架,它将安全性直接嵌入到多模态推理中。该框架集成了四个组件:(I) QI-Safe-10K,一个精选数据集,强调安全关键和推理敏感案例;(II) 安全意识推广,不安全的一代经过反思和纠正,而不是被丢弃;(III)具有多维加权标准的结构化奖励建模,并对幻觉和矛盾进行明确的惩罚;(IV) GRPO 优化,加强安全和纠正轨迹。这种统一的设计将安全性从被动保护转变为推理的主动驱动,从而实现可扩展和通用的安全感知推理。SaFeR-VLM 进一步证明了对显性和隐性风险的鲁棒性,支持超越表面过滤的动态和可解释的安全决策。SaFeR-VLM-3B 在六个基准测试中在安全性和实用性方面的平均性能分别为 70.13 美元和 78.97 美元,超过了 Skywork-R1V10-38B、Qwen2.5VL-72B 和 GLM4.5V-106B 等同规模和 >38 美元更大的型号。值得注意的是,SaFeR-VLM-7B 受益于其规模的扩大,在安全指标上分别超过 GPT-5-mini 和 Gemini-2.5-Flash \num{6.47} 和 \num{16.76} 分,实现了这一改进,而没有任何帮助性能下降。我们的代码可在此 https URL 中找到。
Multi-Dimensional Autoscaling of Stream Processing Services on Edge Devices
边缘设备上流处理服务的多维度自动伸缩
- Authors: Boris Sedlak, Philipp Raith, Andrea Morichetta, Víctor Casamayor Pujol, Schahram Dustdar
- Subjects: Subjects:
Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Performance (cs.PF)
- Arxiv link: https://arxiv.org/abs/2510.06882
- Pdf link: https://arxiv.org/pdf/2510.06882
- Abstract
Edge devices have limited resources, which inevitably leads to situations where stream processing services cannot satisfy their needs. While existing autoscaling mechanisms focus entirely on resource scaling, Edge devices require alternative ways to sustain the Service Level Objectives (SLOs) of competing services. To address these issues, we introduce a Multi-dimensional Autoscaling Platform (MUDAP) that supports fine-grained vertical scaling across both service- and resource-level dimensions. MUDAP supports service-specific scaling tailored to available parameters, e.g., scale data quality or model size for a particular service. To optimize the execution across services, we present a scaling agent based on Regression Analysis of Structural Knowledge (RASK). The RASK agent efficiently explores the solution space and learns a continuous regression model of the processing environment for inferring optimal scaling actions. We compared our approach with two autoscalers, the Kubernetes VPA and a reinforcement learning agent, for scaling up to 9 services on a single Edge device. Our results showed that RASK can infer an accurate regression model in merely 20 iterations (i.e., observe 200s of processing). By increasingly adding elasticity dimensions, RASK sustained the highest request load with 28% less SLO violations, compared to baselines.
- 中文摘要
边缘设备的资源有限,这不可避免地导致流处理服务无法满足其需求的情况。虽然现有的自动缩放机制完全专注于资源缩放,但边缘设备需要替代方法来维持竞争服务的服务级别目标 (SLO)。为了解决这些问题,我们引入了多维自动缩放平台 (MUDAP),它支持跨服务级和资源级维度进行细粒度垂直缩放。MUDAP 支持根据可用参数定制的特定于服务的扩展,例如,特定服务的扩展数据质量或模型大小。为了优化跨服务的执行,我们提出了一种基于结构知识回归分析 (RASK) 的扩展代理。RASK 代理有效地探索解决方案空间并学习处理环境的连续回归模型,以推断最佳缩放作。我们将我们的方法与两个自动扩展器(Kubernetes VPA 和强化学习代理)进行了比较,以便在单个边缘设备上扩展多达 9 个服务。我们的结果表明,RASK 只需 20 次迭代(即观察 200 次处理)就可以推断出准确的回归模型。通过不断增加弹性维度,RASK 保持了最高的请求负载,与基线相比,SLO 违规次数减少了 28%。
Falsification-Driven Reinforcement Learning for Maritime Motion Planning
伪造驱动的海上运动规划强化学习
- Authors: Marlon Müller, Florian Finkeldei, Hanna Krasowski, Murat Arcak, Matthias Althoff
- Subjects: Subjects:
Systems and Control (eess.SY); Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/2510.06970
- Pdf link: https://arxiv.org/pdf/2510.06970
- Abstract
Compliance with maritime traffic rules is essential for the safe operation of autonomous vessels, yet training reinforcement learning (RL) agents to adhere to them is challenging. The behavior of RL agents is shaped by the training scenarios they encounter, but creating scenarios that capture the complexity of maritime navigation is non-trivial, and real-world data alone is insufficient. To address this, we propose a falsification-driven RL approach that generates adversarial training scenarios in which the vessel under test violates maritime traffic rules, which are expressed as signal temporal logic specifications. Our experiments on open-sea navigation with two vessels demonstrate that the proposed approach provides more relevant training scenarios and achieves more consistent rule compliance.
- 中文摘要
遵守海上交通规则对于自主船舶的安全运行至关重要,但培训强化学习 (RL) 代理遵守这些规则具有挑战性。RL 代理的行为取决于它们遇到的训练场景,但创建捕捉海上导航复杂性的场景并非易事,仅靠真实世界的数据是不够的。为了解决这个问题,我们提出了一种伪造驱动的 RL 方法,该方法生成对抗性训练场景,其中被测船舶违反海上交通规则,这些规则表示为信号时间逻辑规范。我们在两艘船的公海航行实验表明,所提出的方法提供了更相关的训练场景,并实现了更一致的规则合规性。
No MoCap Needed: Post-Training Motion Diffusion Models with Reinforcement Learning using Only Textual Prompts
无需动作捕捉:仅使用文本提示进行强化学习的训练后运动扩散模型
- Authors: Girolamo Macaluso, Lorenzo Mandelli, Mirko Bicchierai, Stefano Berretti, Andrew D. Bagdanov
- Subjects: Subjects:
Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/2510.06988
- Pdf link: https://arxiv.org/pdf/2510.06988
- Abstract
Diffusion models have recently advanced human motion generation, producing realistic and diverse animations from textual prompts. However, adapting these models to unseen actions or styles typically requires additional motion capture data and full retraining, which is costly and difficult to scale. We propose a post-training framework based on Reinforcement Learning that fine-tunes pretrained motion diffusion models using only textual prompts, without requiring any motion ground truth. Our approach employs a pretrained text-motion retrieval network as a reward signal and optimizes the diffusion policy with Denoising Diffusion Policy Optimization, effectively shifting the model's generative distribution toward the target domain without relying on paired motion data. We evaluate our method on cross-dataset adaptation and leave-one-out motion experiments using the HumanML3D and KIT-ML datasets across both latent- and joint-space diffusion architectures. Results from quantitative metrics and user studies show that our approach consistently improves the quality and diversity of generated motions, while preserving performance on the original distribution. Our approach is a flexible, data-efficient, and privacy-preserving solution for motion adaptation.
- 中文摘要
扩散模型最近推进了人类运动生成,根据文本提示生成逼真且多样化的动画。然而,使这些模型适应看不见的动作或风格通常需要额外的动作捕捉数据和全面的重新训练,这成本高昂且难以扩展。我们提出了一个基于强化学习的训练后框架,该框架仅使用文本提示来微调预训练的运动扩散模型,而无需任何运动地面实况。我们的方法采用预训练的文本运动检索网络作为奖励信号,并通过去噪扩散策略优化扩散策略,有效地将模型的生成分布转移到目标域,而无需依赖配对运动数据。我们使用HumanML3D和KIT-ML数据集评估了我们在潜在空间和联合空间扩散架构中的跨数据集适应和留一运动实验的方法。定量指标和用户研究的结果表明,我们的方法不断提高生成运动的质量和多样性,同时保持原始分布的性能。我们的方法是一种灵活、数据高效且保护隐私的运动自适应解决方案。
Diffusing Trajectory Optimization Problems for Recovery During Multi-Finger Manipulation
多指作期间恢复的漫射轨迹优化问题
- Authors: Abhinav Kumar, Fan Yang, Sergio Aguilera Marinovic, Soshi Iba, Rana Soltani Zarrin, Dmitry Berenson
- Subjects: Subjects:
Robotics (cs.RO)
- Arxiv link: https://arxiv.org/abs/2510.07030
- Pdf link: https://arxiv.org/pdf/2510.07030
- Abstract
Multi-fingered hands are emerging as powerful platforms for performing fine manipulation tasks, including tool use. However, environmental perturbations or execution errors can impede task performance, motivating the use of recovery behaviors that enable normal task execution to resume. In this work, we take advantage of recent advances in diffusion models to construct a framework that autonomously identifies when recovery is necessary and optimizes contact-rich trajectories to recover. We use a diffusion model trained on the task to estimate when states are not conducive to task execution, framed as an out-of-distribution detection problem. We then use diffusion sampling to project these states in-distribution and use trajectory optimization to plan contact-rich recovery trajectories. We also propose a novel diffusion-based approach that distills this process to efficiently diffuse the full parameterization, including constraints, goal state, and initialization, of the recovery trajectory optimization problem, saving time during online execution. We compare our method to a reinforcement learning baseline and other methods that do not explicitly plan contact interactions, including on a hardware screwdriver-turning task where we show that recovering using our method improves task performance by 96% and that ours is the only method evaluated that can attempt recovery without causing catastrophic task failure. Videos can be found at this https URL.
- 中文摘要
多指手正在成为执行精细作任务(包括工具使用)的强大平台。然而,环境扰动或执行错误可能会阻碍任务性能,从而促使使用恢复行为,从而恢复正常的任务执行。在这项工作中,我们利用扩散模型的最新进展来构建一个框架,该框架可以自主识别何时需要恢复并优化富含接触的轨迹以恢复。我们使用在任务上训练的扩散模型来估计状态何时不利于任务执行,并将其框定为分布外检测问题。然后,我们使用扩散采样来预测这些分布状态,并使用轨迹优化来规划富含接触的恢复轨迹。我们还提出了一种基于扩散的新型方法,该方法提炼了这一过程,以有效地扩散恢复轨迹优化问题的完整参数化,包括约束、目标状态和初始化,从而节省在线执行期间的时间。我们将我们的方法与强化学习基线和其他没有明确计划接触交互的方法进行了比较,包括在硬件螺丝刀转动任务中,我们表明使用我们的方法进行恢复可将任务性能提高 96%,并且我们的方法是唯一可以尝试恢复而不导致灾难性任务失败的方法。视频可以在此 https URL 中找到。
Tool-Augmented Policy Optimization: Synergizing Reasoning and Adaptive Tool Use with Reinforcement Learning
工具增强策略优化:推理和自适应工具使用与强化学习的协同作用
- Authors: Wenxun Wu, Yuanyang Li, Guhan Chen, Linyue Wang, Hongyang Chen
- Subjects: Subjects:
Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2510.07038
- Pdf link: https://arxiv.org/pdf/2510.07038
- Abstract
Recent advances in large language models (LLMs) have popularized test-time scaling, where models generate additional reasoning tokens before producing final answers. These approaches have demonstrated significant performance improvements on benchmarks involving mathematical reasoning. However, language models relying solely on direct inference still struggle with tasks demanding up-to-date knowledge or computational tools such as calculators and code interpreters for complex arithmetic operations. To overcome these limitations, we propose Tool-Augmented Policy Optimization (TAPO), a novel reinforcement learning framework that systematically integrates multi-hop reasoning with adaptive tool-calling capabilities. Our approach employs a modified version of Dynamic Sampling Policy Optimization (DAPO), a recently developed RL paradigm, which we adapt specifically for tool invocation scenarios, enabling models to dynamically interleave complex reasoning with on-demand tool usage (including search APIs and Python interpreters). To support this research, we introduce two new datasets: TAPO-easy-60K and TAPO-hard-18K, specifically designed to train and evaluate both fact-based reasoning and mathematical calculation capabilities. Our experiments on Qwen2.5-3B and Qwen2.5-7B models demonstrate the effectiveness of our approach, with both models achieving state-of-the-art performance on tasks requiring external knowledge and mathematical computation among methods with comparable parameters. Notably, TAPO achieves more efficient tool utilization than baseline methods while preventing excessive calls caused by reward hacking. These results highlight the significant potential of combining advanced reasoning with tool usage to enhance model performance in knowledge-intensive and computationally demanding tasks.
- 中文摘要
大型语言模型 (LLM) 的最新进展普及了测试时间扩展,其中模型在产生最终答案之前生成额外的推理令牌。这些方法在涉及数学推理的基准测试中表现出显着的性能改进。然而,仅依赖直接推理的语言模型仍然难以完成需要最新知识或计算工具(例如计算器和代码解释器)的任务,用于复杂的算术运算。为了克服这些限制,我们提出了工具增强策略优化(TAPO),这是一种新颖的强化学习框架,它系统地将多跳推理与自适应工具调用功能集成在一起。我们的方法采用了动态采样策略优化 (DAPO) 的修改版本,这是一种最近开发的 RL 范式,我们专门针对工具调用场景进行了调整,使模型能够将复杂的推理与按需工具使用(包括搜索 API 和 Python 解释器)动态交错。为了支持这项研究,我们引入了两个新数据集:TAPO-easy-60K 和 TAPO-hard-18K,专门设计用于训练和评估基于事实的推理和数学计算能力。我们在Qwen2.5-3B和Qwen2.5-7B模型上的实验证明了我们方法的有效性,在具有可比参数的方法中,这两个模型在需要外部知识和数学计算的任务上都取得了最先进的性能。值得注意的是,TAPO 比基线方法实现了更高效的工具利用率,同时防止了奖励黑客攻击引起的过度调用。这些结果凸显了将高级推理与工具使用相结合的巨大潜力,以增强模型在知识密集型和计算要求较高的任务中的性能。
Search-R3: Unifying Reasoning and Embedding Generation in Large Language Models
Search-R3:在大型语言模型中统一推理和嵌入生成
- Authors: Yuntao Gui, James Cheng
- Subjects: Subjects:
Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2510.07048
- Pdf link: https://arxiv.org/pdf/2510.07048
- Abstract
Despite their remarkable natural language understanding capabilities, Large Language Models (LLMs) have been underutilized for retrieval tasks. We present Search-R3, a novel framework that addresses this limitation by adapting LLMs to generate search embeddings as a direct output of their reasoning process. Our approach exploits LLMs' chain-of-thought capabilities, allowing them to produce more effective embeddings by reasoning step-by-step through complex semantic analyses. We implement this through three complementary mechanisms. (1) a supervised learning stage enables the model's ability to produce quality embeddings, (2) a reinforcement learning (RL) methodology that optimizes embedding generation alongside reasoning, and (3) a specialized RL environment that efficiently handles evolving embedding representations without requiring complete corpus re-encoding at each training iteration. Our extensive evaluations on diverse benchmarks demonstrate that Search-R3 significantly outperforms prior methods by unifying the reasoning and embedding generation processes. This integrated post-training approach represents a substantial advancement in handling complex knowledge-intensive tasks that require both sophisticated reasoning and effective information retrieval. Project page: this https URL
- 中文摘要
尽管大型语言模型 (LLM) 具有卓越的自然语言理解能力,但在检索任务中尚未得到充分利用。我们提出了 Search-R3,这是一个新颖的框架,它通过调整 LLM 来生成搜索嵌入作为其推理过程的直接输出来解决这一限制。我们的方法利用了法学硕士的思维链能力,使他们能够通过复杂的语义分析逐步推理来产生更有效的嵌入。我们通过三个互补机制来实现这一点。(1) 监督学习阶段使模型能够产生高质量的嵌入,(2) 强化学习 (RL) 方法在推理的同时优化嵌入生成,以及 (3) 专门的 RL 环境,可以有效处理不断发展的嵌入表示,而无需在每次训练迭代时进行完整的语料库重新编码。我们对不同基准的广泛评估表明,Search-R3 通过统一推理和嵌入生成过程,明显优于以前的方法。这种综合的训练后方法代表了在处理复杂的知识密集型任务方面取得了重大进步,这些任务需要复杂的推理和有效的信息检索。项目页面:此 https URL
Sampling Strategies for Robust Universal Quadrupedal Locomotion Policies
稳健通用四足运动策略的采样策略
- Authors: David Rytz, Kim Tien Ly, Ioannis Havoutis
- Subjects: Subjects:
Robotics (cs.RO)
- Arxiv link: https://arxiv.org/abs/2510.07094
- Pdf link: https://arxiv.org/pdf/2510.07094
- Abstract
This work focuses on sampling strategies of configuration variations for generating robust universal locomotion policies for quadrupedal robots. We investigate the effects of sampling physical robot parameters and joint proportional-derivative gains to enable training a single reinforcement learning policy that generalizes to multiple parameter configurations. Three fundamental joint gain sampling strategies are compared: parameter sampling with (1) linear and polynomial function mappings of mass-to-gains, (2) performance-based adaptive filtering, and (3) uniform random sampling. We improve the robustness of the policy by biasing the configurations using nominal priors and reference models. All training was conducted on RaiSim, tested in simulation on a range of diverse quadrupeds, and zero-shot deployed onto hardware using the ANYmal quadruped robot. Compared to multiple baseline implementations, our results demonstrate the need for significant joint controller gains randomization for robust closing of the sim-to-real gap.
- 中文摘要
这项工作的重点是配置变化的采样策略,以生成四足机器人的鲁棒通用运动策略。我们研究了对物理机器人参数和联合比例导分增益进行采样的影响,以便能够训练推广到多个参数配置的单一强化学习策略。比较了三种基本的联合增益采样策略:参数采样,(1)质量增益的线性和多项式函数映射,(2)基于性能的自适应滤波,以及(3)均匀随机采样。我们通过使用名义先验和参考模型对配置进行偏差来提高策略的鲁棒性。所有训练均在 RaiSim 上进行,在一系列不同的四足动物身上进行模拟测试,并使用 ANYmal 四足机器人将零样本部署到硬件上。与多个基线实现相比,我们的结果表明,需要显着的联合控制器增益随机化,以稳健地缩小模拟到实数的差距。
The Contingencies of Physical Embodiment Allow for Open-Endedness and Care
身体体现的偶然性允许开放式和护理
- Authors: Leonardo Christov-Moore (1), Arthur Juliani (1), Alex Kiefer (1 and 2 and 3), Nicco Reggente (1), B. Scott Rousse (4), Adam Safron (1 and 5), Nicol'as Hinrichs (6 and 7), Daniel Polani (8), Antonio Damasio (9) ((1) Institute for Advanced Consciousness Studies, Santa Monica, CA, (2) VERSES, (3) Monash Centre for Consciousness and Contemplative Studies, (4) Allen Discovery Center, (5) Allen Discovery Center, (6) Okinawa Institute of Science and Technology, (7) Max Planck Institute for Human Cognitive and Brain Sciences, (8) University of Hertfordshire, (9) Brain and Creativity Institute)
- Subjects: Subjects:
Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/2510.07117
- Pdf link: https://arxiv.org/pdf/2510.07117
- Abstract
Physical vulnerability and mortality are often seen as obstacles to be avoided in the development of artificial agents, which struggle to adapt to open-ended environments and provide aligned care. Meanwhile, biological organisms survive, thrive, and care for each other in an open-ended physical world with relative ease and efficiency. Understanding the role of the conditions of life in this disparity can aid in developing more robust, adaptive, and caring artificial agents. Here we define two minimal conditions for physical embodiment inspired by the existentialist phenomenology of Martin Heidegger: being-in-the-world (the agent is a part of the environment) and being-towards-death (unless counteracted, the agent drifts toward terminal states due to the second law of thermodynamics). We propose that from these conditions we can obtain both a homeostatic drive - aimed at maintaining integrity and avoiding death by expending energy to learn and act - and an intrinsic drive to continue to do so in as many ways as possible. Drawing inspiration from Friedrich Nietzsche's existentialist concept of will-to-power, we examine how intrinsic drives to maximize control over future states, e.g., empowerment, allow agents to increase the probability that they will be able to meet their future homeostatic needs, thereby enhancing their capacity to maintain physical integrity. We formalize these concepts within a reinforcement learning framework, which enables us to examine how intrinsically driven embodied agents learning in open-ended multi-agent environments may cultivate the capacities for open-endedness and this http URL
- 中文摘要
身体脆弱性和死亡率通常被视为人工药物开发过程中需要避免的障碍,人工药物难以适应开放式环境并提供一致的护理。与此同时,生物有机体在开放的物理世界中相对轻松高效地生存、繁衍生息和相互照顾。了解生活条件在这种差异中的作用有助于开发更强大、适应性和关怀的人工媒介。在这里,我们定义了受马丁·海德格尔存在主义现象学启发的物理体现的两个最低条件:存在于世界(智能体是环境的一部分)和走向死亡(除非被抵消,否则智能体由于热力学第二定律而漂移到终末状态)。我们建议,从这些条件下,我们既可以获得稳态驱动力——旨在通过花费精力学习和行动来保持完整性和避免死亡——以及以尽可能多的方式继续这样做的内在驱动力。从弗里德里希·尼采的存在主义权力意志概念中汲取灵感,我们研究了最大限度地控制未来状态的内在驱动力,例如赋权,如何使代理人能够增加他们能够满足未来稳态需求的可能性,从而增强他们保持身体完整性的能力。我们在强化学习框架中形式化了这些概念,这使我们能够研究在开放式多智能体环境中学习内在驱动的具身智能体如何培养开放式的能力,并且这个 http URL
DPL: Depth-only Perceptive Humanoid Locomotion via Realistic Depth Synthesis and Cross-Attention Terrain Reconstruction
DPL:通过真实深度合成和交叉注意力地形重建的仅深度感知人形运动
- Authors: Jingkai Sun, Gang Han, Pihai Sun, Wen Zhao, Jiahang Cao, Jiaxu Wang, Yijie Guo, Qiang Zhang
- Subjects: Subjects:
Robotics (cs.RO)
- Arxiv link: https://arxiv.org/abs/2510.07152
- Pdf link: https://arxiv.org/pdf/2510.07152
- Abstract
Recent advancements in legged robot perceptive locomotion have shown promising progress. However, terrain-aware humanoid locomotion remains largely constrained to two paradigms: depth image-based end-to-end learning and elevation map-based methods. The former suffers from limited training efficiency and a significant sim-to-real gap in depth perception, while the latter depends heavily on multiple vision sensors and localization systems, resulting in latency and reduced robustness. To overcome these challenges, we propose a novel framework that tightly integrates three key components: (1) Terrain-Aware Locomotion Policy with a Blind Backbone, which leverages pre-trained elevation map-based perception to guide reinforcement learning with minimal visual input; (2) Multi-Modality Cross-Attention Transformer, which reconstructs structured terrain representations from noisy depth images; (3) Realistic Depth Images Synthetic Method, which employs self-occlusion-aware ray casting and noise-aware modeling to synthesize realistic depth observations, achieving over 30\% reduction in terrain reconstruction error. This combination enables efficient policy training with limited data and hardware resources, while preserving critical terrain features essential for generalization. We validate our framework on a full-sized humanoid robot, demonstrating agile and adaptive locomotion across diverse and challenging terrains.
- 中文摘要
腿式机器人感知运动的最新进展显示出可喜的进展。然而,地形感知的人形运动在很大程度上仍然局限于两种范式:基于深度图像的端到端学习和基于高程图的方法。前者训练效率有限,深度感知存在明显的模拟与真实差距,而后者严重依赖多个视觉传感器和定位系统,导致延迟和鲁棒性降低。为了克服这些挑战,我们提出了一个新颖的框架,该框架紧密集成了三个关键组件:(1)具有盲骨干的地形感知运动策略,它利用预训练的基于高程图的感知,以最少的视觉输入指导强化学习;(2)多模态交叉注意力转换器,从嘈杂的深度图像中重建结构化地形表示;(3)真实深度图像合成方法,采用自遮挡感知光线投射和噪声感知建模,合成真实深度观测结果,实现地形重建误差降低30%以上。这种组合可以在有限的数据和硬件资源下实现高效的策略训练,同时保留对泛化至关重要的关键地形特征。我们在全尺寸人形机器人上验证了我们的框架,展示了在不同且具有挑战性的地形上的敏捷和自适应运动。
Reasoning for Hierarchical Text Classification: The Case of Patents
分层文本分类的推理:以专利为例
- Authors: Lekang Jiang, Wenjun Sun, Stephan Goetz
- Subjects: Subjects:
Computation and Language (cs.CL)
- Arxiv link: https://arxiv.org/abs/2510.07167
- Pdf link: https://arxiv.org/pdf/2510.07167
- Abstract
Hierarchical text classification (HTC) assigns documents to multiple levels of a pre-defined taxonomy. Automated patent subject classification represents one of the hardest HTC scenarios because of domain knowledge difficulty and a huge number of labels. Prior approaches only output a flat label set, which offers little insight into the reason behind predictions. Therefore, we propose Reasoning for Hierarchical Classification (RHC), a novel framework that reformulates HTC as a step-by-step reasoning task to sequentially deduce hierarchical labels. RHC trains large language models (LLMs) in two stages: a cold-start stage that aligns outputs with chain-of-thought (CoT) reasoning format and a reinforcement learning (RL) stage to enhance multi-step reasoning ability. RHC demonstrates four advantages in our experiments. (1) Effectiveness: RHC surpasses previous baselines and outperforms the supervised fine-tuning counterparts by approximately 3% in accuracy and macro F1. (2) Explainability: RHC produces natural-language justifications before prediction to facilitate human inspection. (3) Scalability: RHC scales favorably with model size with larger gains compared to standard fine-tuning. (4) Applicability: Beyond patents, we further demonstrate that RHC achieves state-of-the-art performance on other widely used HTC benchmarks, which highlights its broad applicability.
- 中文摘要
分层文本分类 (HTC) 将文档分配给预定义分类的多个级别。自动化专利主题分类是最困难的 HTC 场景之一,因为领域知识难度和大量标签。先前的方法仅输出平面标签集,这几乎无法深入了解预测背后的原因。因此,我们提出了分层分类推理(RHC),这是一个新颖的框架,它将HTC重新表述为一个循序渐进的推理任务,以顺序推断分层标签。RHC 分两个阶段训练大型语言模型 (LLM):冷启动阶段,使输出与思维链 (CoT) 推理格式保持一致,以及强化学习 (RL) 阶段,以增强多步推理能力。RHC 在我们的实验中展示了四个优势。(1)有效性:RHC超过了之前的基线,在精度和宏观F1方面比监督微调同类产品高出约3%。(2)可解释性:RHC在预测之前产生自然语言理由,以方便人工检查。(3)可扩展性:与标准微调相比,RHC在模型大小下具有更大的增益。(4) 适用性:除了专利之外,我们还进一步证明 RHC 在其他广泛使用的 HTC 基准测试上取得了最先进的性能,这凸显了其广泛的适用性。
HyPlan: Hybrid Learning-Assisted Planning Under Uncertainty for Safe Autonomous Driving
HyPlan:不确定性下的混合学习辅助规划,实现安全自动驾驶
- Authors: Donald Pfaffmann, Matthias Klusch, Marcel Steinmetz
- Subjects: Subjects:
Robotics (cs.RO); Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2510.07210
- Pdf link: https://arxiv.org/pdf/2510.07210
- Abstract
We present a novel hybrid learning-assisted planning method, named HyPlan, for solving the collision-free navigation problem for self-driving cars in partially observable traffic environments. HyPlan combines methods for multi-agent behavior prediction, deep reinforcement learning with proximal policy optimization and approximated online POMDP planning with heuristic confidence-based vertical pruning to reduce its execution time without compromising safety of driving. Our experimental performance analysis on the CARLA-CTS2 benchmark of critical traffic scenarios with pedestrians revealed that HyPlan may navigate safer than selected relevant baselines and perform significantly faster than considered alternative online POMDP planners.
- 中文摘要
我们提出了一种名为HyPlan的新型混合学习辅助规划方法,用于解决自动驾驶汽车在部分可观测交通环境中的无碰撞导航问题。HyPlan 将多智能体行为预测、深度强化学习与近端策略优化以及近似在线 POMDP 规划与基于启发式置信度的垂直修剪相结合,以在不影响驾驶安全的情况下缩短其执行时间。我们对行人关键交通场景的 CARLA-CTS2 基准测试的实验性能分析表明,HyPlan 可能比选定的相关基线更安全地导航,并且比考虑的替代在线 POMDP 规划器执行得更快。
Customer-R1: Personalized Simulation of Human Behaviors via RL-based LLM Agent in Online Shopping
Customer-R1:通过基于RL的LLM代理在网上购物中对人类行为进行个性化模拟
- Authors: Ziyi Wang, Yuxuan Lu, Yimeng Zhang, Jing Huang, Dakuo Wang
- Subjects: Subjects:
Computation and Language (cs.CL)
- Arxiv link: https://arxiv.org/abs/2510.07230
- Pdf link: https://arxiv.org/pdf/2510.07230
- Abstract
Simulating step-wise human behavior with Large Language Models (LLMs) has become an emerging research direction, enabling applications in various practical domains. While prior methods, including prompting, supervised fine-tuning (SFT), and reinforcement learning (RL), have shown promise in modeling step-wise behavior, they primarily learn a population-level policy without conditioning on a user's persona, yielding generic rather than personalized simulations. In this work, we pose a critical question: how can LLM agents better simulate personalized user behavior? We introduce Customer-R1, an RL-based method for personalized, step-wise user behavior simulation in online shopping environments. Our policy is conditioned on an explicit persona, and we optimize next-step rationale and action generation via action correctness reward signals. Experiments on the OPeRA dataset emonstrate that Customer-R1 not only significantly outperforms prompting and SFT-based baselines in next-action prediction tasks, but also better matches users' action distribution, indicating higher fidelity in personalized behavior simulation.
- 中文摘要
使用大型语言模型(LLMs)模拟人类逐步行为已成为一个新兴的研究方向,可以在各个实际领域得到应用。虽然以前的方法,包括提示、监督微调 (SFT) 和强化学习 (RL),在逐步行为建模方面显示出前景,但它们主要学习群体级别的策略,而不以用户的角色为条件,从而产生通用而不是个性化的模拟。在这项工作中,我们提出了一个关键问题:LLM 代理如何更好地模拟个性化用户行为?我们介绍了 Customer-R1,这是一种基于 RL 的方法,用于在线购物环境中的个性化、逐步用户行为模拟。我们的政策以明确的角色为条件,我们通过行动正确性奖励信号优化下一步理由和行动生成。OPeRA数据集上的实验表明,Customer-R1不仅在下一步行动预测任务中明显优于提示和基于SFT的基线,而且更好地匹配了用户的动作分布,表明个性化行为模拟的保真度更高。
Red-Bandit: Test-Time Adaptation for LLM Red-Teaming via Bandit-Guided LoRA Experts
Red-Bandit:通过 Bandit 指导的 LoRA 专家对 LLM 红队进行测试时间调整
- Authors: Christos Ziakas, Nicholas Loo, Nishita Jain, Alessandra Russo
- Subjects: Subjects:
Computation and Language (cs.CL)
- Arxiv link: https://arxiv.org/abs/2510.07239
- Pdf link: https://arxiv.org/pdf/2510.07239
- Abstract
Automated red-teaming has emerged as a scalable approach for auditing Large Language Models (LLMs) prior to deployment, yet existing approaches lack mechanisms to efficiently adapt to model-specific vulnerabilities at inference. We introduce Red-Bandit, a red-teaming framework that adapts online to identify and exploit model failure modes under distinct attack styles (e.g., manipulation, slang). Red-Bandit post-trains a set of parameter-efficient LoRA experts, each specialized for a particular attack style, using reinforcement learning that rewards the generation of unsafe prompts via a rule-based safety model. At inference, a multi-armed bandit policy dynamically selects among these attack-style experts based on the target model's response safety, balancing exploration and exploitation. Red-Bandit achieves state-of-the-art results on AdvBench under sufficient exploration (ASR@10), while producing more human-readable prompts (lower perplexity). Moreover, Red-Bandit's bandit policy serves as a diagnostic tool for uncovering model-specific vulnerabilities by indicating which attack styles most effectively elicit unsafe behaviors.
- 中文摘要
自动红队已成为一种在部署之前审计大型语言模型 (LLM) 的可扩展方法,但现有方法缺乏在推理时有效适应特定于模型的漏洞的机制。我们介绍了 Red-Bandit,这是一个红队框架,可以在线调整,以识别和利用不同攻击风格(例如纵、俚语)下的模型故障模式。Red-Bandit 使用强化学习对一组参数高效的 LoRA 专家进行后期训练,每个专家都专门针对特定的攻击风格,通过基于规则的安全模型奖励生成不安全提示。在推理中,多臂强盗策略根据目标模型的响应安全性,在这些攻击式专家中动态选择,平衡探索和利用。Red-Bandit 在充分探索 (ASR@10) 下在 AdvBench 上取得了最先进的结果,同时产生了更多人类可读的提示(较低的困惑度)。此外,Red-Bandit 的强盗策略可作为诊断工具,通过指示哪些攻击风格最有效地引发不安全行为来发现特定于模型的漏洞。
Hybrid Reinforcement: When Reward Is Sparse, It's Better to Be Dense
混合强化:奖励稀疏时,最好是密集的
- Authors: Leitian Tao, Ilia Kulikov, Swarnadeep Saha, Tianlu Wang, Jing Xu, Yixuan Li, Jason E Weston, Ping Yu
- Subjects: Subjects:
Computation and Language (cs.CL); Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/2510.07242
- Pdf link: https://arxiv.org/pdf/2510.07242
- Abstract
Post-training for reasoning of large language models (LLMs) increasingly relies on verifiable rewards: deterministic checkers that provide 0-1 correctness signals. While reliable, such binary feedback is brittle--many tasks admit partially correct or alternative answers that verifiers under-credit, and the resulting all-or-nothing supervision limits learning. Reward models offer richer, continuous feedback, which can serve as a complementary supervisory signal to verifiers. We introduce HERO (Hybrid Ensemble Reward Optimization), a reinforcement learning framework that integrates verifier signals with reward-model scores in a structured way. HERO employs stratified normalization to bound reward-model scores within verifier-defined groups, preserving correctness while refining quality distinctions, and variance-aware weighting to emphasize challenging prompts where dense signals matter most. Across diverse mathematical reasoning benchmarks, HERO consistently outperforms RM-only and verifier-only baselines, with strong gains on both verifiable and hard-to-verify tasks. Our results show that hybrid reward design retains the stability of verifiers while leveraging the nuance of reward models to advance reasoning.
- 中文摘要
大型语言模型 (LLM) 推理的后期训练越来越依赖于可验证的奖励:提供 0-1 正确性信号的确定性检查器。虽然可靠,但这种二元反馈很脆弱——许多任务承认验证者信用不足的部分正确或替代答案,由此产生的全有或全无监督限制了学习。奖励模型提供更丰富、更连续的反馈,可以作为验证者的补充监督信号。我们引入了 HERO(混合集成奖励优化),这是一种强化学习框架,它以结构化的方式将验证者信号与奖励模型分数集成在一起。HERO 采用分层归一化来绑定验证者定义的组内的奖励模型分数,在细化质量差异的同时保持正确性,并采用方差感知加权来强调密集信号最重要的具有挑战性的提示。在不同的数学推理基准中,HERO 始终优于仅 RM 和仅验证者的基线,在可验证和难以验证的任务上都有强劲的收益。我们的结果表明,混合奖励设计保留了验证者的稳定性,同时利用奖励模型的细微差别来推进推理。
Test-Time Graph Search for Goal-Conditioned Reinforcement Learning
目标条件强化学习的测试时间图搜索
- Authors: Evgenii Opryshko, Junwei Quan, Claas Voelcker, Yilun Du, Igor Gilitschenski
- Subjects: Subjects:
Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/2510.07257
- Pdf link: https://arxiv.org/pdf/2510.07257
- Abstract
Offline goal-conditioned reinforcement learning (GCRL) trains policies that reach user-specified goals at test time, providing a simple, unsupervised, domain-agnostic way to extract diverse behaviors from unlabeled, reward-free datasets. Nonetheless, long-horizon decision making remains difficult for GCRL agents due to temporal credit assignment and error accumulation, and the offline setting amplifies these effects. To alleviate this issue, we introduce Test-Time Graph Search (TTGS), a lightweight planning approach to solve the GCRL task. TTGS accepts any state-space distance or cost signal, builds a weighted graph over dataset states, and performs fast search to assemble a sequence of subgoals that a frozen policy executes. When the base learner is value-based, the distance is derived directly from the learned goal-conditioned value function, so no handcrafted metric is needed. TTGS requires no changes to training, no additional supervision, no online interaction, and no privileged information, and it runs entirely at inference. On the OGBench benchmark, TTGS improves success rates of multiple base learners on challenging locomotion tasks, demonstrating the benefit of simple metric-guided test-time planning for offline GCRL.
- 中文摘要
离线目标条件强化学习 (GCRL) 训练在测试时达到用户指定目标的策略,提供一种简单、无监督、与领域无关的方法,从未标记、无奖励的数据集中提取不同的行为。尽管如此,由于时间信用分配和误差累积,GCRL 代理的长期决策仍然很困难,而离线设置放大了这些影响。为了缓解这个问题,我们引入了测试时图搜索 (TTGS),这是一种解决 GCRL 任务的轻量级规划方法。TTGS 接受任何状态空间距离或成本信号,在数据集状态上构建加权图,并执行快速搜索以组装冻结策略执行的一系列子目标。当基本学习器基于值时,距离直接从学习到的目标条件值函数中导出,因此不需要手工制作的指标。TTGS 不需要更改训练、无需额外监督、无需在线交互、无需特权信息,并且完全在推理中运行。在OGBench基准测试中,TTGS提高了多个基本学习器在具有挑战性的运动任务上的成功率,证明了离线GCRL的简单指标引导测试时间规划的好处。
Online Rubrics Elicitation from Pairwise Comparisons
从成对比较中提取在线评分标准
- Authors: MohammadHossein Rezaei, Robert Vacareanu, Zihao Wang, Clinton Wang, Yunzhong He, Afra Feyza Akyürek
- Subjects: Subjects:
Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/2510.07284
- Pdf link: https://arxiv.org/pdf/2510.07284
- Abstract
Rubrics provide a flexible way to train LLMs on open-ended long-form answers where verifiable rewards are not applicable and human preferences provide coarse signals. Prior work shows that reinforcement learning with rubric-based rewards leads to consistent gains in LLM post-training. Most existing approaches rely on rubrics that remain static over the course of training. Such static rubrics, however, are vulnerable to reward-hacking type behaviors and fail to capture emergent desiderata that arise during training. We introduce Online Rubrics Elicitation (OnlineRubrics), a method that dynamically curates evaluation criteria in an online manner through pairwise comparisons of responses from current and reference policies. This online process enables continuous identification and mitigation of errors as training proceeds. Empirically, this approach yields consistent improvements of up to 8% over training exclusively with static rubrics across AlpacaEval, GPQA, ArenaHard as well as the validation sets of expert questions and rubrics. We qualitatively analyze the elicited criteria and identify prominent themes such as transparency, practicality, organization, and reasoning.
- 中文摘要
评分标准提供了一种灵活的方式来训练法学硕士进行开放式长格式答案,其中可验证的奖励不适用,并且人类偏好提供粗略的信号。先前的研究表明,使用基于评分标准的奖励的强化学习可以使 LLM 在训练后获得持续的收益。大多数现有方法依赖于在培训过程中保持静态的评分标准。然而,这种静态评分标准很容易受到奖励黑客类型的行为的影响,并且无法捕获训练期间出现的紧急偏差。我们引入了在线评分标准引出 (OnlineRubrics),这是一种通过对当前政策和参考政策的响应进行成对比较,以在线方式动态策划评估标准的方法。随着培训的进行,这种在线过程可以持续识别和减少错误。根据经验,与仅使用跨 AlpacaEval、GPQA、ArenaHard 的静态评分标准以及专家问题和评分标准的验证集进行培训相比,这种方法可产生高达 8% 的持续改进。我们对得出的标准进行定性分析,并确定透明度、实用性、组织性和推理等突出主题。
Evolutionary Profiles for Protein Fitness Prediction
蛋白质适应性预测的进化概况
- Authors: Jigang Fan, Xiaoran Jiao, Shengdong Lin, Zhanming Liang, Weian Mao, Chenchen Jing, Hao Chen, Chunhua Shen
- Subjects: Subjects:
Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Biomolecules (q-bio.BM); Quantitative Methods (q-bio.QM)
- Arxiv link: https://arxiv.org/abs/2510.07286
- Pdf link: https://arxiv.org/pdf/2510.07286
- Abstract
Predicting the fitness impact of mutations is central to protein engineering but constrained by limited assays relative to the size of sequence space. Protein language models (pLMs) trained with masked language modeling (MLM) exhibit strong zero-shot fitness prediction; we provide a unifying view by interpreting natural evolution as implicit reward maximization and MLM as inverse reinforcement learning (IRL), in which extant sequences act as expert demonstrations and pLM log-odds serve as fitness estimates. Building on this perspective, we introduce EvoIF, a lightweight model that integrates two complementary sources of evolutionary signal: (i) within-family profiles from retrieved homologs and (ii) cross-family structural-evolutionary constraints distilled from inverse folding logits. EvoIF fuses sequence-structure representations with these profiles via a compact transition block, yielding calibrated probabilities for log-odds scoring. On ProteinGym (217 mutational assays; >2.5M mutants), EvoIF and its MSA-enabled variant achieve state-of-the-art or competitive performance while using only 0.15% of the training data and fewer parameters than recent large models. Ablations confirm that within-family and cross-family profiles are complementary, improving robustness across function types, MSA depths, taxa, and mutation depths. The codes will be made publicly available at this https URL.
- 中文摘要
预测突变的适应度影响是蛋白质工程的核心,但受到相对于序列空间大小的有限检测的限制。使用掩码语言建模 (MLM) 训练的蛋白质语言模型 (pLM) 表现出强大的零样本适应度预测;我们通过将自然进化解释为隐式奖励最大化和传销解释为逆强化学习 (IRL) 来提供统一的观点,其中现有序列充当专家演示,pLM 对数赔率充当适应度估计。基于这一观点,我们引入了 EvoIF,这是一个轻量级模型,它集成了两个互补的进化信号源:(i) 来自检索到的同源物的家族内概况和 (ii) 从逆折叠 logit 中提炼的跨家族结构进化约束。EvoIF 通过紧凑的过渡块将序列结构表示与这些配置文件融合在一起,从而产生用于对数赔率评分的校准概率。在 ProteinGym(217 个突变测定;>2.5M 个突变体)上,EvoIF 及其支持 MSA 的变体实现了最先进或有竞争力的性能,同时仅使用 0.15% 的训练数据和比最近的大模型更少的参数。消融证实,家族内和跨家族的谱是互补的,提高了跨功能类型、MSA 深度、类群和突变深度的鲁棒性。这些代码将在此 https URL 上公开。
Keyword: diffusion policy
No MoCap Needed: Post-Training Motion Diffusion Models with Reinforcement Learning using Only Textual Prompts
无需动作捕捉:仅使用文本提示进行强化学习的训练后运动扩散模型
- Authors: Girolamo Macaluso, Lorenzo Mandelli, Mirko Bicchierai, Stefano Berretti, Andrew D. Bagdanov
- Subjects: Subjects:
Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/2510.06988
- Pdf link: https://arxiv.org/pdf/2510.06988
- Abstract
Diffusion models have recently advanced human motion generation, producing realistic and diverse animations from textual prompts. However, adapting these models to unseen actions or styles typically requires additional motion capture data and full retraining, which is costly and difficult to scale. We propose a post-training framework based on Reinforcement Learning that fine-tunes pretrained motion diffusion models using only textual prompts, without requiring any motion ground truth. Our approach employs a pretrained text-motion retrieval network as a reward signal and optimizes the diffusion policy with Denoising Diffusion Policy Optimization, effectively shifting the model's generative distribution toward the target domain without relying on paired motion data. We evaluate our method on cross-dataset adaptation and leave-one-out motion experiments using the HumanML3D and KIT-ML datasets across both latent- and joint-space diffusion architectures. Results from quantitative metrics and user studies show that our approach consistently improves the quality and diversity of generated motions, while preserving performance on the original distribution. Our approach is a flexible, data-efficient, and privacy-preserving solution for motion adaptation.
- 中文摘要
扩散模型最近推进了人类运动生成,根据文本提示生成逼真且多样化的动画。然而,使这些模型适应看不见的动作或风格通常需要额外的动作捕捉数据和全面的重新训练,这成本高昂且难以扩展。我们提出了一个基于强化学习的训练后框架,该框架仅使用文本提示来微调预训练的运动扩散模型,而无需任何运动地面实况。我们的方法采用预训练的文本运动检索网络作为奖励信号,并通过去噪扩散策略优化扩散策略,有效地将模型的生成分布转移到目标域,而无需依赖配对运动数据。我们使用HumanML3D和KIT-ML数据集评估了我们在潜在空间和联合空间扩散架构中的跨数据集适应和留一运动实验的方法。定量指标和用户研究的结果表明,我们的方法不断提高生成运动的质量和多样性,同时保持原始分布的性能。我们的方法是一种灵活、数据高效且保护隐私的运动自适应解决方案。