生成时间: 2026-01-23 16:33:32 (UTC+8); Arxiv 发布时间: 2026-01-23 20:00 EST (2026-01-24 09:00 UTC+8)

今天共有 21 篇相关文章

Keyword: reinforcement learning

ICPO: Illocution-Calibrated Policy Optimization for Multi-Turn Conversation

ICPO:多回合对话的言外校准政策优化

A Mobile Magnetic Manipulation Platform for Gastrointestinal Navigation with Deep Reinforcement Learning Control

一款具深度强化学习控制的移动磁力作平台,用于胃肠道导航

When Sharpening Becomes Collapse: Sampling Bias and Semantic Coupling in RL with Verifiable Rewards

当锐化变成崩溃:强化学习中的采样偏差与语义耦合与可验证奖励

AION: Aerial Indoor Object-Goal Navigation Using Dual-Policy Reinforcement Learning

AION:利用双策略强化学习实现空中室内物体-目标导航

Explainable Deepfake Detection with RL Enhanced Self-Blended Images

使用强化学习增强自混合图像进行可解释的深度伪造检测

Robust Tool Use via Fission-GRPO: Learning to Recover from Execution Errors

通过裂变-GRPO实现稳健工具使用:学习从执行错误中恢复

EmotionThinker: Prosody-Aware Reinforcement Learning for Explainable Speech Emotion Reasoning

EmotionThinker:用于可解释言语情感推理的韵律感知强化学习

Performance-guided Reinforced Active Learning for Object Detection

基于性能的强化主动学习对象检测

From Passive Metric to Active Signal: The Evolving Role of Uncertainty Quantification in Large Language Models

从被动度量到主动信号:不确定性量化在大型语言模型中不断演变的角色

Dancing in Chains: Strategic Persuasion in Academic Rebuttal via Theory of Mind

《锁链中的舞蹈:通过心智理论进行学术反驳中的战略说服》

PhysProver: Advancing Automatic Theorem Proving for Physics

PhysProver:推进物理学自动定理证明

Off-Policy Actor-Critic with Sigmoid-Bounded Entropy for Real-World Robot Learning

非策略演员-批评者,采用S形有界熵,用于现实世界机器人学习

Decoupling Return-to-Go for Efficient Decision Transformer

高效决策变换器的退货解耦

PUMA: Perception-driven Unified Foothold Prior for Mobility Augmented Quadruped Parkour

彪马:感知驱动的统一立足点,机动性增强四足跑酷的先行

Keyframe-Based Feed-Forward Visual Odometry

基于关键帧的前馈视觉里程计

Dynamic Tactile Sensing System and Soft Actor Critic Reinforcement Learning for Inclusion Characterization

动态触觉感知系统和软演员批评强化学习用于包容性刻画

SAMTok: Representing Any Mask with Two Words

SAMTok:用两个词代表任何面具

Efficiently Learning Robust Torque-based Locomotion Through Reinforcement with Model-Based Supervision

通过基于模型的监督高效学习基于扭矩的稳健行进,通过强化

Structured Hints for Sample-Efficient Lean Theorem Proving

样本高效精益定理证明的结构化提示

Learning to Discover at Test Time

考试时学习发现

LLM-in-Sandbox Elicits General Agentic Intelligence

沙盒中的大型语言模型引发通用智能

Keyword: diffusion policy

There is no result