生成时间: 2026-03-09 16:50:48 (UTC+8); Arxiv 发布时间: 2026-03-09 20:00 EDT (2026-03-10 08:00 UTC+8)

今天共有 34 篇相关文章

Keyword: reinforcement learning

Autocorrelation effects in a stochastic-process model for decision making via time series

随机过程模型中的自相关效应,用于通过时间序列做决策

PRISM: Personalized Refinement of Imitation Skills for Manipulation via Human Instructions

棱镜:通过人类指令个性化精炼模仿技能以实现控

A Novel Hybrid Heuristic-Reinforcement Learning Optimization Approach for a Class of Railcar Shunting Problems

一种针对一类铁路车辆调车问题的新型混合启发式-强化学习优化方法

Thinking with Spatial Code for Physical-World Video Reasoning

用空间代码思考物理世界视频推理

When Rubrics Fail: Error Enumeration as Reward in Reference-Free RL Post-Training for Virtual Try-On

当评分标准失效:在无参考的强化学习后训练中,错误枚举作为奖励用于虚拟试用

MIRACL: A Diverse Meta-Reinforcement Learning for Multi-Objective Multi-Echelon Combinatorial Supply Chain Optimisation

MIRACL:一种多目标多层级组合供应链优化的多样化元强化学习

Task-Level Decisions to Gait Level Control: A Hierarchical Policy Approach for Quadruped Navigation

步态控制的任务级决策:四足导航的层级策略方法

OpenHEART: Opening Heterogeneous Articulated Objects with a Legged Manipulator

OpenHEART:用带腿的作器打开异构关节物体

Expert Knowledge-driven Reinforcement Learning for Autonomous Racing via Trajectory Guidance and Dynamics Constraints

基于轨迹引导和动力学约束的专家知识驱动的自主赛车强化学习

ReflexiCoder: Teaching Large Language Models to Self-Reflect on Generated Code and Self-Correct It via Reinforcement Learning

ReflexiCoder:教大型语言模型自我反思生成代码并通过强化学习自我纠正

PatchCue: Enhancing Vision-Language Model Reasoning with Patch-Based Visual Cues

PatchCue:利用基于Patch的视觉线索增强视觉语言模型推理

Confidence Before Answering: A Paradigm Shift for Efficient LLM Uncertainty Estimation

回答前的信心:高效LLM不确定性估计的范式转变

Reference-guided Policy Optimization for Molecular Optimization via LLM Reasoning

通过LLM推理的参考引导分子优化策略

Swooper: Learning High-Speed Aerial Grasping With a Simple Gripper

Swooper:用简单抓钳学习高速空中抓取

How to Model Your Crazyflie Brushless

如何建模你的无刷疯蝇

LucidNFT: LR-Anchored Multi-Reward Preference Optimization for Generative Real-World Super-Resolution

LucidNFT:生成现实世界超分辨率的LR锚定多奖励偏好优化

TADPO: Reinforcement Learning Goes Off-road

TADPO:强化学习越野

ViewFusion: Structured Spatial Thinking Chains for Multi-View Reasoning

ViewFusion:多视角推理的结构化空间思维链

Learning to Generate via Understanding: Understanding-Driven Intrinsic Rewarding for Unified Multimodal Models

通过理解学习生成:统一多模态模型的理解驱动内在奖励

Devil is in Narrow Policy: Unleashing Exploration in Driving VLA Models

狭隘政策中的魔鬼:释放探索驱动VLA模型

ChatShopBuddy: Towards Reliable Conversational Shopping Agents via Reinforcement Learning

ChatShopBuddy:通过强化学习迈向可靠的对话购物代理

Partial Policy Gradients for RL in LLMs

大型语言模型中强化学习的部分策略梯度

Dual-Agent Multiple-Model Reinforcement Learning for Event-Triggered Human-Robot Co-Adaptation in Decoupled Task Spaces

双代理多模型强化学习,用于解耦任务空间中的事件触发人机共适应

Optimizing 3D Diffusion Models for Medical Imaging via Multi-Scale Reward Learning

通过多尺度奖励学习优化医学影像的3D扩散模型

MAPO: Mixed Advantage Policy Optimization for Long-Horizon Multi-Turn Dialogue

MAPO:面向长期多回合对话的混合优势政策优化

Synthetic Monitoring Environments for Reinforcement Learning

强化学习的合成监控环境

Artificial Intelligence for Climate Adaptation: Reinforcement Learning for Climate Change-Resilient Transport

气候适应的人工智能:为气候变化韧性交通提供强化学习

From Entropy to Calibrated Uncertainty: Training Language Models to Reason About Uncertainty

从熵到校准不确定性:训练语言模型以推理不确定性

OralGPT-Plus: Learning to Use Visual Tools via Reinforcement Learning for Panoramic X-ray Analysis

OralGPT-Plus:通过强化学习学习使用视觉工具进行全景X射线分析

Efficient, Property-Aligned Fan-Out Retrieval via RL-Compiled Diffusion

通过强化学习编译扩散实现高效、属性对齐的扇出检索

A Reference Architecture of Reinforcement Learning Frameworks

强化学习框架的参考架构

EgoReasoner: Learning Egocentric 4D Reasoning via Task-Adaptive Structured Thinking

自我推理者:通过任务适应结构化思维学习以自我为中心的四维推理

Boosting deep Reinforcement Learning using pretraining with Logical Options

利用逻辑选项的预训练提升深度强化学习

Keyword: diffusion policy

CDF-Glove: A Cable-Driven Force Feedback Glove for Dexterous Teleoperation

CDF-Glove:一种用于灵巧远程作的钢索驱动力反馈手套