生成时间: 2025-11-07 16:30:08 (UTC+8); Arxiv 发布时间: 2025-11-07 20:00 EST (2025-11-08 09:00 UTC+8)

今天共有 27 篇相关文章

Keyword: reinforcement learning

Scaling Agent Learning via Experience Synthesis

通过经验综合扩展代理学习

From Static to Dynamic: Enhancing Offline-to-Online Reinforcement Learning via Energy-Guided Diffusion Stratification

从静态到动态:通过能量引导扩散分层增强离线到在线的强化学习

RLHF: A comprehensive Survey for Cultural, Multimodal and Low Latency Alignment Methods

RLHF:文化、多模态和低潜伏对齐方法的综合调查

Adaptive Temporal Refinement: Continuous Depth Allocation and Distance Regression for Efficient Action Localization

自适应时间细化:连续深度分配和距离回归以实现高效的动作定位

Learning Vision-Driven Reactive Soccer Skills for Humanoid Robots

学习人形机器人的视觉驱动反应性足球技能

Necessary and Sufficient Conditions for the Optimization-Based Concurrent Execution of Learned Robotic Tasks

基于优化并发执行学习机器人任务的必要充分条件

CBMC-V3: A CNS-inspired Control Framework Towards Manipulation Agility with SNN

CBMC-V3:受中枢神经系统启发的控制框架,利用SNN实现作敏捷性

RIDE: Difficulty Evolving Perturbation with Item Response Theory for Mathematical Reasoning

RIDE:使用项目响应理论进行数学推理的进化扰动困难

BFM-Zero: A Promptable Behavioral Foundation Model for Humanoid Control Using Unsupervised Reinforcement Learning

BFM-Zero:一种基于无监督强化学习的人形控制的可提示行为基础模型

Exchange Policy Optimization Algorithm for Semi-Infinite Safe Reinforcement Learning

面向半无限安全强化学习的交换策略优化算法

PUL-SLAM: Path-Uncertainty Co-Optimization with Lightweight Stagnation Detection for Efficient Robotic Exploration

PUL-SLAM:路径不确定性协同优化与轻量级停滞检测,实现高效机器人探索

Black-Box Guardrail Reverse-engineering Attack

黑匣子护栏逆向工程攻击

Opus: A Quantitative Framework for Workflow Evaluation

Opus:工作流程评估的定量框架

Shared Spatial Memory Through Predictive Coding

通过预测编码共享空间记忆

Can Context Bridge the Reality Gap? Sim-to-Real Transfer of Context-Aware Policies

上下文可以弥合现实差距吗?上下文感知策略的模拟到真实传输

SSPO: Subsentence-level Policy Optimization

SSPO:子句级策略优化

RLoop: An Self-Improving Framework for Reinforcement Learning with Iterative Policy Initialization

RLoop:具有迭代策略初始化的强化学习的自我改进框架

GUI-360: A Comprehensive Dataset and Benchmark for Computer-Using Agents

GUI-360:计算机使用代理的综合数据集和基准测试

MacroNav: Multi-Task Context Representation Learning Enables Efficient Navigation in Unknown Environments

MacroNav:多任务上下文表示学习实现未知环境中的高效导航

Temporal Action Selection for Action Chunking

动作分块的时间动作选择

The Peril of Preference: Why GRPO fails on Ordinal Rewards

偏好的危险:为什么 GRPO 在序数奖励上失败

Fitting Reinforcement Learning Model to Behavioral Data under Bandits

强化学习模型拟合强盗行为数据

V-Thinker: Interactive Thinking with Images

V-Thinker:图像交互式思维

End-to-End Reinforcement Learning of Koopman Models for eNMPC of an Air Separation Unit

空分装置eNMPC的Koopman模型的端到端强化学习

Environment Agnostic Goal-Conditioning, A Study of Reward-Free Autonomous Learning

与环境无关的目标条件反射,无奖励自主学习研究

Forgetting is Everywhere

遗忘无处不在

GentleHumanoid: Learning Upper-body Compliance for Contact-rich Human and Object Interaction

GentleHumanoid:学习上半身顺应性,实现接触丰富的人与物交互

Keyword: diffusion policy

There is no result