生成时间: 2026-06-10 19:40:58 (UTC+8); Arxiv 发布时间: 2026-06-10 20:00 EDT (2026-06-11 08:00 UTC+8)

今天共有 41 篇相关文章

Keyword: reinforcement learning

Self-EmoQ: Plutchik-Guided Value-based Planning to Drive Streaming Emotional TTS

自我表情问答:普鲁奇克引导的价值导向规划推动流媒体情感TTS

TD-Grokking: Learning from Zero-Reward Problems by Training-Time Decomposition

TD-Grokking:通过训练时间分解从零奖励问题中学习

Failure Modes of Deep Multi-Agent RL in Asynchronous Pricing: Reproducible Triggers, Trace Diagnostics, and a Partial Fix

异步定价中深度多智能体强化学习的失败模式:可重现触发器、跟踪诊断与部分修复

SocraticPO: Policy Optimization via Interactive Guidance

苏格拉底邮政:通过互动指导进行政策优化

When RL Fails after SFT: Rejuvenating Model Plasticity for Robust SFT-to-RL Handoff

当强化学习在SFT后失败:重振模型可塑性以实现稳健的SFT到RL交接

Uncertainty-Aware Motion Planning for Autonomous Driving in Mixed Traffic Environment

混合交通环境中自动驾驶的不确定性感知运动规划

3SPO: State-Score-Supervised Policy Optimization for LLM Agents

3SPO:LLM代理的状态评分监督策略优化

Discovering Interpretable Multi-Parameter Control Policies for Evolutionary Algorithms Using Deep Reinforcement Learning

利用深度强化学习发现进化算法的可解释多参数控制策略

Dropout-GRPO: Variational Stochasticity for Continuous Latent Reasoning

Dropout-GRPO:连续潜在推理的变分随机性

SHAPO: Sharpness-Aware Policy Optimization for Safe Exploration

SHAPO:锐利感知策略优化,实现安全探索

Locomotion analysis of a quadruped interacting with the lunar granular surface

四足动物与月球颗粒表面相互作用的运动分析

MARCH: Model-Assisted Reinforcement Learning for the Perceptive Control of Humanoids over Sparse Footholds

三月:模型辅助强化学习,用于对稀疏立足点的类人生物感知控制

SARM2: Multi-Task Stage Aware Reward Modeling for Self Improving Robotic Manipulation

SARM2:多任务阶段感知奖励建模,用于自我提升的机器人操作

Reasoning or Memorization? Direction-Aware Diversity Exploration in LLM Reinforcement Learning

推理还是记忆?LLM强化学习中的方向感知多样性探索

ReflectiChain: Epistemic Grounding in LLM-Driven World Models for Supply Chain Resilience

ReflectiChain:以大型语言模型驱动的世界模型中的认知基础,促进供应链韧性

Belief-Space Control for Personalized Cancer Treatment via Active Inference

通过主动推理实现个性化癌症治疗的信念空间控制

Mitigating Bias in Low-SNR Financial Reinforcement Learning via Quantum Representations

通过量子表征缓解低信噪比金融强化学习中的偏置

GuideWalk: Learning Unified Autonomous Navigation and Locomotion for Humanoid Robots across Versatile Terrains

GuideWalk:学习跨多样地形的类人机器人统一自主导航与移动

HIPIF: Hierarchical Planning and Information Folding for Long-Horizon LLM Agent Learning

HIPIF:长视野LLM代理学习的分层规划与信息折叠

Representation-Aware Advantage Estimation: Your Reward Model Provides More Than A Scalar Output

表征感知优势估计:你的奖励模型提供的不仅仅是标量输出

Dmsh: A Multi-Agent Reinforcement Learning Framework for All-Quad Mesh Generation

Dmsh:用于全四元网格生成的多智能体强化学习框架

Geometry-Aware Reinforcement Learning for 2D Irregular Nesting

二维不规则嵌套的几何感知强化学习

Fast and Highly Expressive Policy Learning for Offline Reinforcement Learning via Bootstrapped Flow Q-Learning

通过引导流Q-Learning实现的快速且高度表达性的离线强化学习策略学习

How Does Reasoning Flow? Tracing Attention-Induced Information Flow for Targeted RL in LLMs

推理是如何流动的?追踪大型语言模型中针对目标强化学习的注意力诱导信息流

Vector Map as Language: Toward Unified Remote Sensing Vector Mapping

矢量图作为语言:迈向统一遥感矢量映射

Event-Driven Reinforcement Learning Enables Long-Horizon Control in Semiconductor Fabrication

事件驱动强化学习实现半导体制造中的长视野控制

MODIP: Efficient Model-Based Optimization for Diffusion Policies

MODIP:基于模型的高效扩散策略优化

GUIDE: Goal-Initialized Directional Understanding for End-to-End Visual Navigation

指南:端到端视觉导航的目标初始化方向理解

Pushing the Limits of LLM Tool Calling via Experiential Knowledge Integration and Activation

通过体验式知识集成与激活,突破LLM工具调用的极限

AllDayNav: Lifelong Navigation via Real-World Reinforcement Learning

AllDayNav:通过现实世界强化学习实现终身导航

Beyond Uniform Token-Level Trust Region in LLM Reinforcement Learning

超越统一令牌级信任区域,LLM强化学习

Bellman-Taylor Score Decoding for Markov Decision Processes with State-Dependent Feasible Action Sets

带有状态依赖可行行动集的马尔可夫决策过程的贝尔曼-泰勒评分解码

Flow-DPPO: Divergence Proximal Policy Optimization for Flow Matching Models

Flow-DPPO:流匹配模型的离度近端策略优化

LLM-Mediated Demand Response Coordination in Smart Microgrids

智能微电网中的大型语言模型介导需求响应协调

Test-Time Gradient Guidance of Flow Policies in Reinforcement Learning

强化学习中流策略的测试时间梯度指导

RoboNaldo: Accurate, Stable and Powerful Humanoid Soccer Shooting via Motion-Guided Curriculum Reinforcement Learning

RoboNaldo:通过动作引导课程强化学习实现的精准、稳定且强力的人形足球射门

TRACE: A Unified Rollout Budget Allocation Framework for Efficient Agentic Reinforcement Learning

TRACE:高效代理强化学习的统一推广预算分配框架

Multi-Faceted Interactivity Alignment in Full-Duplex Speech Models

全双工语音模型中的多面交互性对齐

ARM: An AutoRegressive Large Multimodal Model with Unified Discrete Representations

ARM:具有统一离散表示的自回归大型多模模型

Keyword: diffusion policy

GHOST: Hierarchical Sub-Goal Policies for Generalizing Robot Manipulation

GHOST:用于推广机器人操作的层级子目标策略

MODIP: Efficient Model-Based Optimization for Diffusion Policies

MODIP:基于模型的高效扩散策略优化