Arxiv Papers of Today

生成时间: 2026-02-18 16:49:53 (UTC+8); Arxiv 发布时间: 2026-02-18 20:00 EST (2026-02-19 09:00 UTC+8)

今天共有 14 篇相关文章

Keyword: reinforcement learning

CLOT: Closed-Loop Global Motion Tracking for Whole-Body Humanoid Teleoperation

CLOT：闭环全球运动追踪，用于全身人形远程作

Authors: Tengjie Zhu, Guanyu Cai, Yang Zhaohui, Guanzhu Ren, Haohui Xie, ZiRui Wang, Junsong Wu, Jingbo Wang, Xiaokang Yang, Yao Mu, Yichao Yan, Yichao Yan
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.15060
Pdf link: https://arxiv.org/pdf/2602.15060
Abstract Long-horizon whole-body humanoid teleoperation remains challenging due to accumulated global pose drift, particularly on full-sized humanoids. Although recent learning-based tracking methods enable agile and coordinated motions, they typically operate in the robot's local frame and neglect global pose feedback, leading to drift and instability during extended execution. In this work, we present CLOT, a real-time whole-body humanoid teleoperation system that achieves closed-loop global motion tracking via high-frequency localization feedback. CLOT synchronizes operator and robot poses in a closed loop, enabling drift-free human-to-humanoid mimicry over long timehorizons. However, directly imposing global tracking rewards in reinforcement learning, often results in aggressive and brittle corrections. To address this, we propose a data-driven randomization strategy that decouples observation trajectories from reward evaluation, enabling smooth and stable global corrections. We further regularize the policy with an adversarial motion prior to suppress unnatural behaviors. To support CLOT, we collect 20 hours of carefully curated human motion data for training the humanoid teleoperation policy. We design a transformer-based policy and train it for over 1300 GPU hours. The policy is deployed on a full-sized humanoid with 31 DoF (excluding hands). Both simulation and real-world experiments verify high-dynamic motion, high-precision tracking, and strong robustness in sim-to-real humanoid teleoperation. Motion data, demos and code can be found in our website.
中文摘要 由于累积的全局姿态漂移，尤其是全尺寸人形，长视野全身人形远程作仍具挑战性。尽管基于学习的追踪方法实现了敏捷和协调的动作，但它们通常在机器人的局部框架内运行，忽视了全局姿态反馈，导致长时间执行时出现漂移和不稳定。在本研究中，我们介绍了CLOT，一种实时全身人形远程作系统，通过高频定位反馈实现闭环全球运动跟踪。CLOT 在闭环中同步和机器人姿势，实现长时间范围内无漂移的人形模拟。然而，直接在强化学习中强加全局追踪奖励，往往会导致激烈且脆弱的纠正。为此，我们提出了一种数据驱动的随机化策略，将观测轨迹与奖励评估解耦，实现平稳稳定的全局修正。我们进一步通过对抗性动议规范该政策，以抑制不自然行为。为支持CLOT，我们收集了20小时精心策划的人体运动数据，用于培训类人远程作政策。我们设计基于变换器的策略，并训练超过1300小时的GPU时间。该政策部署在一个全尺寸的人形生物身上，目标为31 DoF（不含手部）。模拟和现实实验验证了模拟到真实人形远程作的高动态运动、高精度跟踪和强鲁棒性。动态数据、演示和代码可在我们的网站上找到。

Near-Optimal Sample Complexity for Online Constrained MDPs

在线受限MDP的近似最优样本复杂度

Authors: Chang Liu, Yunfan Li, Lin F. Yang
Subjects: Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2602.15076
Pdf link: https://arxiv.org/pdf/2602.15076
Abstract Safety is a fundamental challenge in reinforcement learning (RL), particularly in real-world applications such as autonomous driving, robotics, and healthcare. To address this, Constrained Markov Decision Processes (CMDPs) are commonly used to enforce safety constraints while optimizing performance. However, existing methods often suffer from significant safety violations or require a high sample complexity to generate near-optimal policies. We address two settings: relaxed feasibility, where small violations are allowed, and strict feasibility, where no violation is allowed. We propose a model-based primal-dual algorithm that balances regret and bounded constraint violations, drawing on techniques from online RL and constrained optimization. For relaxed feasibility, we prove that our algorithm returns an $\varepsilon$-optimal policy with $\varepsilon$-bounded violation with arbitrarily high probability, requiring $\tilde{O}\left(\frac{SAH^3}{\varepsilon^2}\right)$ learning episodes, matching the lower bound for unconstrained MDPs. For strict feasibility, we prove that our algorithm returns an $\varepsilon$-optimal policy with zero violation with arbitrarily high probability, requiring $\tilde{O}\left(\frac{SAH^5}{\varepsilon^2\zeta^2}\right)$ learning episodes, where $\zeta$ is the problem-dependent Slater constant characterizing the size of the feasible region. This result matches the lower bound for learning CMDPs with access to a generative model. Our results demonstrate that learning CMDPs in an online setting is as easy as learning with a generative model and is no more challenging than learning unconstrained MDPs when small violations are allowed.
中文摘要 安全是强化学习（RL）中的一个根本挑战，尤其是在自动驾驶、机器人和医疗保健等现实应用中。为此，通常使用受限马尔可夫决策过程（CMDP）来执行安全约束并优化性能。然而，现有方法常常存在严重的安全违规，或需要高样本复杂度才能生成接近最优策略。我们讨论了两种情况：宽松可行性，允许小额违规;严格可行性，允许任何违规。我们提出了一种基于模型的原始对偶算法，能够平衡后悔和有界约束违背，借鉴在线强化学习和约束优化的技术。为了宽松可行性，我们证明算法返回一个$\varepsilon$最优策略，且有$\varepsilon$有界违规，且具有任意高概率，需要$\tilde{O}\left（\frac{SAH^3}{\varepsilon^2}\right）$的学习时段，匹配无约束MDP的下界。为了严格可行性，我们证明算法以任意高概率返回一个$\varepsilon$最优策略且零违规，需要$\tilde{O}\left（\frac{SAH^5}{\varepsilon^2\zeta^2}\right）$学习片段，其中$\zeta$是描述可行区域大小的相关问题相关Slater常数。这一结果与学习生成模型CMDP的下限相符。我们的结果表明，在线环境中学习CMDP与使用生成模型学习相当简单，且在允许小幅违规时学习无限制MDP并不更具挑战性。

MyoInteract: A Framework for Fast Prototyping of Biomechanical HCI Tasks using Reinforcement Learning

MyoInteract：利用强化学习快速成型生物力学HCI任务的框架

Authors: Ankit Bhattarai, Hannah Selder, Florian Fischer, Arthur Fleig, Per Ola Kristensson
Subjects: Subjects: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.15245
Pdf link: https://arxiv.org/pdf/2602.15245
Abstract Reinforcement learning (RL)-based biomechanical simulations have the potential to revolutionise HCI research and interaction design, but currently lack usability and interpretability. Using the Human Action Cycle as a design lens, we identify key limitations of biomechanical RL frameworks and develop MyoInteract, a novel framework for fast prototyping of biomechanical HCI tasks. MyoInteract allows designers to setup tasks, user models, and training parameters from an easy-to-use GUI within minutes. It trains and evaluates muscle-actuated simulated users within minutes, reducing training times by up to 98%. A workshop study with 12 interaction designers revealed that MyoInteract allowed novices in biomechanical RL to successfully setup, train, and assess goal-directed user movements within a single session. By transforming biomechanical RL from a days-long expert task into an accessible hour-long workflow, this work significantly lowers barriers to entry and accelerates iteration cycles in HCI biomechanics research.
中文摘要 基于强化学习（RL）的生物力学仿真有潜力彻底革新人机交互研究和交互设计，但目前缺乏可用性和可解释性。以人类行动循环为设计视角，我们识别了生物力学强化学习框架的关键局限性，并开发了MyoInteract，这是一种用于生物力学HCI任务快速原型制作的新型框架。MyoInteract允许设计师在几分钟内通过易用的图形界面设置任务、用户模型和训练参数。它能在几分钟内训练和评估肌肉驱动的模拟用户，将训练时间缩短多达98%。一项由12位交互设计师参与的工作坊研究显示，MyoInteract使生物力学强化学习的新手能够在单次会话中成功设置、培训并评估目标导向的用户动作。通过将生物机械强化学习从数天的专家任务转变为一个可访问的一小时工作流程，这项工作显著降低了进入门槛，加快了HCI生物力学研究的迭代周期。

EventMemAgent: Hierarchical Event-Centric Memory for Online Video Understanding with Adaptive Tool Use

EventMemAgent：基于自适应工具的分层事件中心记忆，用于在线视频理解

Authors: Siwei Wen, Zhangcheng Wang, Xingjian Zhang, Lei Huang, Wenjun Wu
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2602.15329
Pdf link: https://arxiv.org/pdf/2602.15329
Abstract Online video understanding requires models to perform continuous perception and long-range reasoning within potentially infinite visual streams. Its fundamental challenge lies in the conflict between the unbounded nature of streaming media input and the limited context window of Multimodal Large Language Models (MLLMs). Current methods primarily rely on passive processing, which often face a trade-off between maintaining long-range context and capturing the fine-grained details necessary for complex tasks. To address this, we introduce EventMemAgent, an active online video agent framework based on a hierarchical memory module. Our framework employs a dual-layer strategy for online videos: short-term memory detects event boundaries and utilizes event-granular reservoir sampling to process streaming video frames within a fixed-length buffer dynamically; long-term memory structuredly archives past observations on an event-by-event basis. Furthermore, we integrate a multi-granular perception toolkit for active, iterative evidence capture and employ Agentic Reinforcement Learning (Agentic RL) to end-to-end internalize reasoning and tool-use strategies into the agent's intrinsic capabilities. Experiments show that EventMemAgent achieves competitive results on online video benchmarks. The code will be released here: this https URL.
中文摘要 在线视频理解需要模型在潜在无限的视觉流中进行连续感知和远程推理。其根本挑战在于流媒体输入的无界性与多模态大型语言模型（MLLM）有限的上下文窗口之间的冲突。目前的方法主要依赖被动处理，这通常在保持远距离上下文和捕捉复杂任务所需细粒度细节之间面临权衡。为此，我们引入了EventMemAgent，一个基于分层内存模块的主动在线视频代理框架。我们的框架采用双层策略进行在线视频：短期记忆检测事件边界，并利用事件粒状存储器采样动态处理固定长度缓冲区内的流媒体视频帧;长期记忆结构化地逐事件存档过去的观察。此外，我们整合了多粒度感知工具包，用于主动迭代证据捕获，并运用代理强化学习（Agentic Reinforcement Learning，简称Agentic RL）将推理和工具使用策略从端到端内化为代理的内在能力。实验显示，EventMemAgent在在线视频基准测试中取得了具有竞争力的成绩。代码将在这里发布：这个 https URL。

CDRL: A Reinforcement Learning Framework Inspired by Cerebellar Circuits and Dendritic Computational Strategies

CDRL：受小脑回路和树突计算策略启发的强化学习框架

Authors: Sibo Zhang, Rui Jing, Liangfu Lv, Jian Zhang, Yunliang Zang
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
Arxiv link: https://arxiv.org/abs/2602.15367
Pdf link: https://arxiv.org/pdf/2602.15367
Abstract Reinforcement learning (RL) has achieved notable performance in high-dimensional sequential decision-making tasks, yet remains limited by low sample efficiency, sensitivity to noise, and weak generalization under partial observability. Most existing approaches address these issues primarily through optimization strategies, while the role of architectural priors in shaping representation learning and decision dynamics is less explored. Inspired by structural principles of the cerebellum, we propose a biologically grounded RL architecture that incorporate large expansion, sparse connectivity, sparse activation, and dendritic-level modulation. Experiments on noisy, high-dimensional RL benchmarks show that both the cerebellar architecture and dendritic modulation consistently improve sample efficiency, robustness, and generalization compared to conventional designs. Sensitivity analysis of architectural parameters suggests that cerebellum-inspired structures can offer optimized performance for RL with constrained model parameters. Overall, our work underscores the value of cerebellar structural priors as effective inductive biases for RL.
中文摘要 强化学习（RL）在高维顺序决策任务中取得了显著表现，但仍受限于低采样效率、对噪声敏感性以及部分可观测性下泛化较弱。大多数现有方法主要通过优化策略来解决这些问题，而架构先验在塑造表征学习和决策动态中的作用则较少被探讨。受小脑结构原理启发，我们提出了一种生物学基础的强化学习结构，结合了大扩展、稀疏连接、稀疏激活和树突层级调节。在噪声高维强化学习基准测试上的实验表明，小脑结构和树突调制在样本效率、鲁棒性和泛化性方面均有显著提升，相较于传统设计。对建筑参数的敏感性分析表明，受小脑启发的结构可以在受限模型参数下为强化学习提供最佳性能。总体而言，我们的研究强调了小脑结构先验作为强化学习有效归纳偏差的价值。

Fairness over Equality: Correcting Social Incentives in Asymmetric Sequential Social Dilemmas

公平胜于平等：纠正非对称连续社会困境中的社会激励

Authors: Alper Demir, Hüseyin Aydın, Kale-ab Abebe Tessera, David Abel, Stefano V. Albrecht
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2602.15407
Pdf link: https://arxiv.org/pdf/2602.15407
Abstract Sequential Social Dilemmas (SSDs) provide a key framework for studying how cooperation emerges when individual incentives conflict with collective welfare. In Multi-Agent Reinforcement Learning, these problems are often addressed by incorporating intrinsic drives that encourage prosocial or fair behavior. However, most existing methods assume that agents face identical incentives in the dilemma and require continuous access to global information about other agents to assess fairness. In this work, we introduce asymmetric variants of well-known SSD environments and examine how natural differences between agents influence cooperation dynamics. Our findings reveal that existing fairness-based methods struggle to adapt under asymmetric conditions by enforcing raw equality that wrongfully incentivize defection. To address this, we propose three modifications: (i) redefining fairness by accounting for agents' reward ranges, (ii) introducing an agent-based weighting mechanism to better handle inherent asymmetries, and (iii) localizing social feedback to make the methods effective under partial observability without requiring global information sharing. Experimental results show that in asymmetric scenarios, our method fosters faster emergence of cooperative policies compared to existing approaches, without sacrificing scalability or practicality.
中文摘要 连续社会困境（SSD）为研究当个人激励与集体福利发生冲突时合作如何产生提供了关键框架。在多智能体强化学习中，这些问题通常通过融入内在驱动力来解决，鼓励亲社会或公平行为。然而，大多数现有方法假设代理人在困境中面临相同的激励，并要求持续访问其他代理人的全球信息以评估公平性。本研究介绍了已知SSD环境的非对称变体，并探讨代理间自然差异如何影响合作动态。我们的发现显示，现有基于公平的方法在不对称条件下难以适应，因为他们强行执行了错误激励叛逃的原始平等。为此，我们提出了三项修改：（i）通过考虑代理的奖励范围重新定义公平性，（ii）引入基于代理的加权机制以更好地处理固有不对称性，以及（iii）局部化社会反馈，使方法在部分可观测性下有效且无需全局信息共享。实验结果表明，在非对称场景下，我们的方法比现有方法更快地促成合作政策的出现，同时不牺牲可扩展性和实用性。

Efficient Knowledge Transfer for Jump-Starting Control Policy Learning of Multirotors through Physics-Aware Neural Architectures

通过物理感知神经架构实现多旋翼控制策略学习的高效知识转移

Authors: Welf Rehberg, Mihir Kulkarni, Philipp Weiss, Kostas Alexis
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2602.15533
Pdf link: https://arxiv.org/pdf/2602.15533
Abstract Efficiently training control policies for robots is a major challenge that can greatly benefit from utilizing knowledge gained from training similar systems through cross-embodiment knowledge transfer. In this work, we focus on accelerating policy training using a library-based initialization scheme that enables effective knowledge transfer across multirotor configurations. By leveraging a physics-aware neural control architecture that combines a reinforcement learning-based controller and a supervised control allocation network, we enable the reuse of previously trained policies. To this end, we utilize a policy evaluation-based similarity measure that identifies suitable policies for initialization from a library. We demonstrate that this measure correlates with the reduction in environment interactions needed to reach target performance and is therefore suited for initialization. Extensive simulation and real-world experiments confirm that our control architecture achieves state-of-the-art control performance, and that our initialization scheme saves on average up to $73.5\%$ of environment interactions (compared to training a policy from scratch) across diverse quadrotor and hexarotor designs, paving the way for efficient cross-embodiment transfer in reinforcement learning.
中文摘要 高效训练机器人控制策略是一项重大挑战，利用通过跨身体知识转移对类似系统训练所得的知识将大有裨益。本研究重点是利用基于库的初始化方案加速政策培训，实现多旋翼配置间的有效知识转移。通过结合基于强化学习的控制器和监督控制分配网络的物理感知神经控制架构，我们实现了先前训练策略的复用。为此，我们采用基于策略评估的相似度度量，从库中识别适合初始化的策略。我们证明该指标与实现目标性能所需的环境交互减少相关，因此适合初始化。广泛的仿真和实际实验证实，我们的控制架构实现了最先进的控制性能，且我们的初始化方案平均节省了高达73.5%的环境交互（相比从零开始训练策略），跨多样的四旋翼和六角飞行器设计，为高效的跨身体转移在强化学习中铺平了道路。

Beyond Static Pipelines: Learning Dynamic Workflows for Text-to-SQL

超越静态流水线：学习文本转SQL的动态工作流程

Authors: Yihan Wang, Peiyu Liu, Runyu Chen, Wei Xu
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.15564
Pdf link: https://arxiv.org/pdf/2602.15564
Abstract Text-to-SQL has recently achieved impressive progress, yet remains difficult to apply effectively in real-world scenarios. This gap stems from the reliance on single static workflows, fundamentally limiting scalability to out-of-distribution and long-tail scenarios. Instead of requiring users to select suitable methods through extensive experimentation, we attempt to enable systems to adaptively construct workflows at inference time. Through theoretical and empirical analysis, we demonstrate that optimal dynamic policies consistently outperform the best static workflow, with performance gains fundamentally driven by heterogeneity across candidate workflows. Motivated by this, we propose SquRL, a reinforcement learning framework that enhances LLMs' reasoning capability in adaptive workflow construction. We design a rule-based reward function and introduce two effective training mechanisms: dynamic actor masking to encourage broader exploration, and pseudo rewards to improve training efficiency. Experiments on widely-used Text-to-SQL benchmarks demonstrate that dynamic workflow construction consistently outperforms the best static workflow methods, with especially pronounced gains on complex and out-of-distribution queries. The codes are available at this https URL
中文摘要 文本转SQL最近取得了显著进展，但在现实世界中仍然难以有效应用。这一差距源于对单一静态工作流程的依赖，根本上限制了扩展性仅限于非分销和长尾场景。我们不要求用户通过大量实验选择合适的方法，而是尝试让系统能够在推断时自适应地构建工作流程。通过理论和实证分析，我们证明最优动态策略始终优于最佳静态工作流程，性能提升根本上由候选工作流间的异质性驱动。基于此，我们提出了SquRL，一种增强LLM在自适应工作流构建中推理能力的强化学习框架。我们设计了基于规则的奖励函数，并引入了两种有效的训练机制：动态演员遮蔽以鼓励更广泛的探索，以及伪奖励以提升训练效率。广泛使用的文本转SQL基准测试实验表明，动态工作流构建始终优于最佳静态工作流方法，尤其在复杂且分布外的查询中效果显著。代码可在此 https URL 获取

STAPO: Stabilizing Reinforcement Learning for LLMs by Silencing Rare Spurious Tokens

STAPO：通过静音稀有虚假代币稳定大型语言模型的强化学习

Authors: Shiqi Liu, Zeyu He, Guojian Zhan, Letian Tao, Zhilong Zheng, Jiang Wu, Yinuo Wang, Yang Guan, Kehua Sheng, Bo Zhang, Keqiang Li, Jingliang Duan, Shengbo Eben Li
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.15620
Pdf link: https://arxiv.org/pdf/2602.15620
Abstract Reinforcement Learning (RL) has significantly improved large language model reasoning, but existing RL fine-tuning methods rely heavily on heuristic techniques such as entropy regularization and reweighting to maintain stability. In practice, they often experience late-stage performance collapse, leading to degraded reasoning quality and unstable training. We derive that the magnitude of token-wise policy gradients in RL is negatively correlated with token probability and local policy entropy. Building on this result, we prove that training instability is driven by a tiny fraction of tokens, approximately 0.01\%, which we term \emph{spurious tokens}. When such tokens appear in correct responses, they contribute little to the reasoning outcome but inherit the full sequence-level reward, leading to abnormally amplified gradient updates. Motivated by this observation, we propose Spurious-Token-Aware Policy Optimization (STAPO) for large-scale model refining, which selectively masks such updates and renormalizes the loss over valid tokens. Across six mathematical reasoning benchmarks using Qwen 1.7B, 8B, and 14B base models, STAPO consistently demonstrates superior entropy stability and achieves an average performance improvement of 7.13\% over GRPO, 20-Entropy and JustRL.
中文摘要 强化学习（RL）显著提升了大型语言模型推理能力，但现有的强化学习微调方法高度依赖熵正则化和加权等启发式技术来保持稳定性。实际上，他们常常在后期阶段表现崩溃，导致推理质量下降和训练不稳定。我们推断出，在强化学习中，代币级策略梯度的大小与代币概率和局部策略熵呈负相关。基于此结果，我们证明训练不稳定性由极少数代币驱动，约为0.01%，我们称之为\emph{spurious tokens}。当这些标记出现在正确回答中时，它们对推理结果贡献不大，但继承了完整的序列级奖励，导致梯度更新异常放大。基于这一观察，我们提出了用于大规模模型细化的虚假令牌感知策略优化（SSTAPO），该方法选择性地掩盖此类更新，并对有效令牌的损失进行重整化。在六个数学推理基准测试中，使用Qwen 1.7B、8B和14B基础模型，STAPO始终展现出优越的熵稳定性，平均性能提升为GRPO、20熵和JustRL7.13%。

Recursive Concept Evolution for Compositional Reasoning in Large Language Models

大型语言模型中组合推理的递归概念演化

Authors: Sarim Chaudhry
Subjects: Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2602.15725
Pdf link: https://arxiv.org/pdf/2602.15725
Abstract Large language models achieve strong performance on many complex reasoning tasks, yet their accuracy degrades sharply on benchmarks that require compositional reasoning, including ARC-AGI-2, GPQA, MATH, BBH, and HLE. Existing methods improve reasoning by expanding token-level search through chain-of-thought prompting, self-consistency, or reinforcement learning, but they leave the model's latent representation space fixed. When the required abstraction is not already encoded in this space, performance collapses. We propose Recursive Concept Evolution (RCE), a framework that enables pretrained language models to modify their internal representation geometry during inference. RCE introduces dynamically generated low-rank concept subspaces that are spawned when representational inadequacy is detected, selected through a minimum description length criterion, merged when synergistic, and consolidated via constrained optimization to preserve stability. This process allows the model to construct new abstractions rather than recombining existing ones. We integrate RCE with Mistral-7B and evaluate it across compositional reasoning benchmarks. RCE yields 12-18 point gains on ARC-AGI-2, 8-14 point improvements on GPQA and BBH, and consistent reductions in depth-induced error on MATH and HLE.
中文摘要 大型语言模型在许多复杂推理任务中表现出色，但在需要组合推理的基准测试（如ARC-AGI-2、GPQA、数学、BBH和HLE）时，其准确性却大幅下降。现有方法通过思维链提示、自洽性或强化学习扩展代币级搜索来提升推理能力，但它们保持模型潜在的表示空间是固定的。当所需的抽象尚未编码在该空间时，性能会崩溃。我们提出了递归概念演化（RCE）框架，使预训练语言模型在推理过程中能够修改其内部表示几何。RCE引入了动态生成的低秩概念子空间，这些子空间在检测到表征不足时生成，通过最小描述长度标准选择，协同时合并，并通过约束优化巩固以保持稳定性。这一过程使模型能够构建新的抽象，而不是重新组合已有的抽象。我们将RCE与Mistral-7B整合，并在组合推理基准中进行评估。RCE在ARC-AGI-2上提升12-18分，GPQA和BBH提升8-14分，MATH和HLE深度诱导误差持续减少。

MeshMimic: Geometry-Aware Humanoid Motion Learning through 3D Scene Reconstruction

MeshMimic：通过三维场景重建实现几何感知类人生物运动学习

Authors: Qiang Zhang, Jiahao Ma, Peiran Liu, Shuai Shi, Zeran Su, Zifan Wang, Jingkai Sun, Wei Cui, Jialin Yu, Gang Han, Wen Zhao, Pihai Sun, Kangning Yin, Jiaxu Wang, Jiahang Cao, Lingfeng Zhang, Hao Cheng, Xiaoshuai Hao, Yiding Ji, Junwei Liang, Jian Tang, Renjing Xu, Yijie Guo
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.15733
Pdf link: https://arxiv.org/pdf/2602.15733
Abstract Humanoid motion control has witnessed significant breakthroughs in recent years, with deep reinforcement learning (RL) emerging as a primary catalyst for achieving complex, human-like behaviors. However, the high dimensionality and intricate dynamics of humanoid robots make manual motion design impractical, leading to a heavy reliance on expensive motion capture (MoCap) data. These datasets are not only costly to acquire but also frequently lack the necessary geometric context of the surrounding physical environment. Consequently, existing motion synthesis frameworks often suffer from a decoupling of motion and scene, resulting in physical inconsistencies such as contact slippage or mesh penetration during terrain-aware tasks. In this work, we present MeshMimic, an innovative framework that bridges 3D scene reconstruction and embodied intelligence to enable humanoid robots to learn coupled "motion-terrain" interactions directly from video. By leveraging state-of-the-art 3D vision models, our framework precisely segments and reconstructs both human trajectories and the underlying 3D geometry of terrains and objects. We introduce an optimization algorithm based on kinematic consistency to extract high-quality motion data from noisy visual reconstructions, alongside a contact-invariant retargeting method that transfers human-environment interaction features to the humanoid agent. Experimental results demonstrate that MeshMimic achieves robust, highly dynamic performance across diverse and challenging terrains. Our approach proves that a low-cost pipeline utilizing only consumer-grade monocular sensors can facilitate the training of complex physical interactions, offering a scalable path toward the autonomous evolution of humanoid robots in unstructured environments.
中文摘要 近年来，类人生物运动控制取得了重大突破，深度强化学习（RL）成为实现复杂类人行为的主要催化剂。然而，人形机器人的高维度和复杂的动态使得手动动作设计不切实际，导致大量依赖昂贵的动作捕捉（MoCap）数据。这些数据集不仅获取成本高昂，而且往往缺乏周围物理环境所需的几何背景。因此，现有的运动综合框架常常存在运动与场景脱耦的问题，导致在地形感知任务中出现接触滑移或网格穿透等物理不一致现象。在本研究中，我们介绍了MeshMimic，一种创新框架，连接了3D场景重建与具身智能，使类人机器人能够直接从视频中学习耦合的“运动-地形”交互。通过利用最先进的三维视觉模型，我们的框架能够精确切割和重建人类轨迹以及地形和物体的底层三维几何。我们引入了基于运动学一致性的优化算法，从噪声视觉重建中提取高质量运动数据，同时采用接触不变重定向方法，将人与环境交互特征传递给类人智能体。实验结果表明，MeshMimic在多样且具有挑战性的地形中实现了强健且高度动态的性能。我们的方法证明，仅使用消费级单眼传感器的低成本流程，可以促进复杂物理交互的训练，为人形机器人在非结构化环境中自主演进提供了可扩展的路径。

GLM-5: from Vibe Coding to Agentic Engineering

GLM-5：从Vibe编码到代理工程

Authors: GLM-5 Team: Aohan Zeng, Xin Lv, Zhenyu Hou, Zhengxiao Du, Qinkai Zheng, Bin Chen, Da Yin, Chendi Ge, Chengxing Xie, Cunxiang Wang, Gengzheng Pan, Hao Zeng, Haoke Zhang, Haoran Wang, Huilong Chen, Jiajie Zhang, Jian Jiao, Jiaqi Guo, Jingsen Wang, Jingzhao Du, Jinzhu Wu, Kedong Wang, Lei Li, Lin Fan, Lucen Zhong, Mingdao Liu, Mingming Zhao, Pengfan Du, Qian Dong, Rui Lu, Shuang-Li, Shulin Cao, Song Liu, Ting Jiang, Xiaodong Chen, Xiaohan Zhang, Xuancheng Huang, Xuezhen Dong, Yabo Xu, Yao Wei, Yifan An, Yilin Niu, Yitong Zhu, Yuanhao Wen, Yukuo Cen, Yushi Bai, Zhongpei Qiao, Zihan Wang, Zikang Wang, Zilin Zhu, Ziqiang Liu, Zixuan Li, Bojie Wang, Bosi Wen, Can Huang, Changpeng Cai, Chao Yu, Chen Li, Chen Li, Chenghua Huang, Chengwei Hu, Chenhui Zhang, Chenzheng Zhu, Congfeng Yin, Daoyan Lin, Dayong Yang, Di Wang, Ding Ai, Erle Zhu, Fangzhou Yi, Feiyu Chen, Guohong Wen, Hailong Sun, Haisha Zhao, Haiyi Hu, Hanchen Zhang, Hanrui Liu, Hanyu Zhang, Hao Peng, Hao Tai, Haobo Zhang, He Liu, Hongwei Wang, Hongxi Yan, Hongyu Ge, Huan Liu, Huan Liu, Huanpeng Chu, Jia'ni Zhao, Jiachen Wang, Jiajing Zhao, Jiamin Ren, Jiapeng Wang, Jiaxin Zhang, Jiayi Gui, Jiayue Zhao, Jijie Li, Jing An, Jing Li
Subjects: Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2602.15763
Pdf link: https://arxiv.org/pdf/2602.15763
Abstract We present GLM-5, a next-generation foundation model designed to transition the paradigm of vibe coding to agentic engineering. Building upon the agentic, reasoning, and coding (ARC) capabilities of its predecessor, GLM-5 adopts DSA to significantly reduce training and inference costs while maintaining long-context fidelity. To advance model alignment and autonomy, we implement a new asynchronous reinforcement learning infrastructure that drastically improves post-training efficiency by decoupling generation from training. Furthermore, we propose novel asynchronous agent RL algorithms that further improve RL quality, enabling the model to learn from complex, long-horizon interactions more effectively. Through these innovations, GLM-5 achieves state-of-the-art performance on major open benchmarks. Most critically, GLM-5 demonstrates unprecedented capability in real-world coding tasks, surpassing previous baselines in handling end-to-end software engineering challenges. Code, models, and more information are available at this https URL.
中文摘要 我们介绍GLM-5，一款下一代基础模型，旨在将氛围编码范式转变为代理工程范式。GLM-5 基于前代的代理、推理和编码（ARC）能力，采用 DSA，显著降低训练和推理成本，同时保持长上下文的忠实度。为了推进模型对齐和自主性，我们实施了新的异步强化学习基础设施，通过将生成与训练分离，极大提升了训练后的效率。此外，我们提出了新的异步智能体强化学习算法，进一步提升强化学习质量，使模型能够更有效地从复杂的长视野交互中学习。通过这些创新，GLM-5在主要开放基准测试中实现了最先进的性能。最重要的是，GLM-5在实际编码任务中展现了前所未有的能力，超越了以往的基准水平，处理端到端软件工程挑战。代码、模型及更多信息可在此 https URL 获取。

Solving Parameter-Robust Avoid Problems with Unknown Feasibility using Reinforcement Learning

利用强化学习解决参数鲁棒问题，避免可行性未知的问题

Authors: Oswin So, Eric Yang Yu, Songyuan Zhang, Matthew Cleaveland, Mitchell Black, Chuchu Fan
Subjects: Subjects: Machine Learning (cs.LG); Robotics (cs.RO); Optimization and Control (math.OC)
Arxiv link: https://arxiv.org/abs/2602.15817
Pdf link: https://arxiv.org/pdf/2602.15817
Abstract Recent advances in deep reinforcement learning (RL) have achieved strong results on high-dimensional control tasks, but applying RL to reachability problems raises a fundamental mismatch: reachability seeks to maximize the set of states from which a system remains safe indefinitely, while RL optimizes expected returns over a user-specified distribution. This mismatch can result in policies that perform poorly on low-probability states that are still within the safe set. A natural alternative is to frame the problem as a robust optimization over a set of initial conditions that specify the initial state, dynamics and safe set, but whether this problem has a solution depends on the feasibility of the specified set, which is unknown a priori. We propose Feasibility-Guided Exploration (FGE), a method that simultaneously identifies a subset of feasible initial conditions under which a safe policy exists, and learns a policy to solve the reachability problem over this set of initial conditions. Empirical results demonstrate that FGE learns policies with over 50% more coverage than the best existing method for challenging initial conditions across tasks in the MuJoCo simulator and the Kinetix simulator with pixel observations.
中文摘要 深度强化学习（RL）的最新进展在高维控制任务中取得了显著成果，但将强化学习应用于可达性问题则存在根本性不匹配：可达性旨在最大化系统无限期保持安全的状态集合，而强化学习则优化用户指定分布下的预期收益。这种不匹配可能导致政策在低概率状态下表现不佳，而这些状态仍处于安全范围内。一个自然的替代方案是将问题框架为对一组初始条件的稳健优化，这些条件指定了初始状态、动力学和安全集，但该问题是否有解取决于指定集合的可行性，而该集合的可行性尚未知。我们提出可行性引导探索（FGE），这是一种方法，既识别存在安全政策的可行初始条件子集，又学习解决该初始条件可达性问题的策略。实证结果表明，FGE在MuJoCo模拟器和Kinetix模拟器中通过像素观测学习的初始条件挑战方法，覆盖率比现有最佳方法高出50%以上。

Perceptive Humanoid Parkour: Chaining Dynamic Human Skills via Motion Matching

感知类人跑酷：通过动作匹配串联动态人类技能

Authors: Zhen Wu, Xiaoyu Huang, Lujie Yang, Yuanhang Zhang, Koushil Sreenath, Xi Chen, Pieter Abbeel, Rocky Duan, Angjoo Kanazawa, Carmelo Sferrazza, Guanya Shi, C. Karen Liu
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2602.15827
Pdf link: https://arxiv.org/pdf/2602.15827
Abstract While recent advances in humanoid locomotion have achieved stable walking on varied terrains, capturing the agility and adaptivity of highly dynamic human motions remains an open challenge. In particular, agile parkour in complex environments demands not only low-level robustness, but also human-like motion expressiveness, long-horizon skill composition, and perception-driven decision-making. In this paper, we present Perceptive Humanoid Parkour (PHP), a modular framework that enables humanoid robots to autonomously perform long-horizon, vision-based parkour across challenging obstacle courses. Our approach first leverages motion matching, formulated as nearest-neighbor search in a feature space, to compose retargeted atomic human skills into long-horizon kinematic trajectories. This framework enables the flexible composition and smooth transition of complex skill chains while preserving the elegance and fluidity of dynamic human motions. Next, we train motion-tracking reinforcement learning (RL) expert policies for these composed motions, and distill them into a single depth-based, multi-skill student policy, using a combination of DAgger and RL. Crucially, the combination of perception and skill composition enables autonomous, context-aware decision-making: using only onboard depth sensing and a discrete 2D velocity command, the robot selects and executes whether to step over, climb onto, vault or roll off obstacles of varying geometries and heights. We validate our framework with extensive real-world experiments on a Unitree G1 humanoid robot, demonstrating highly dynamic parkour skills such as climbing tall obstacles up to 1.25m (96% robot height), as well as long-horizon multi-obstacle traversal with closed-loop adaptation to real-time obstacle perturbations.
中文摘要 尽管近年来类人生物运动技术已实现在多样地形上的稳定行走，但捕捉高度动态人体动作的敏捷性和适应性仍是一个开放的挑战。特别是在复杂环境中的敏捷跑酷不仅需要低水平的稳健性，还需要类似人类的运动表现力、远距离技能组合以及基于感知的决策。本文介绍了感知类人跑酷（Perceptive Humanoid Parkour，简称PHP），这是一种模块化框架，使类人机器人能够自主完成远距离、基于视觉的跑酷，穿越具有挑战性障碍赛道的障碍。我们的方法首先利用运动匹配，这一方法被表述为特征空间中的最近邻搜索，将重定向的原子人类技能组合成长视野的运动学轨迹。该框架使复杂技能链能够灵活组合并实现平滑过渡，同时保持动态人体动作的优雅与流畅性。接下来，我们为这些组合动作训练动作追踪强化学习（RL）专家策略，并将其提炼成基于深度的多技能学生策略，结合DAgger和RL。关键是，感知与技能组合的结合使机器人能够自主且具备上下文感知的决策能力：仅凭车载深度感知和离散的二维速度指令，机器人就能选择并执行跨越、攀爬、跳跃或滚落不同几何形状和高度的障碍物。我们通过在Unitree G1类人机器人上的大量真实实验验证了我们的框架，展示了高度动态跑酷技能，如攀爬高达1.25米（机器人高度的96%）高障碍，以及长视野多障碍穿越并闭环适应实时障碍扰动的能力。

Keyword: diffusion policy

There is no result