生成时间: 2026-02-23 16:53:28 (UTC+8); Arxiv 发布时间: 2026-02-23 20:00 EST (2026-02-24 09:00 UTC+8)

今天共有 22 篇相关文章

Keyword: reinforcement learning

Epistemic Traps: Rational Misalignment Driven by Model Misspecification

认识陷阱:由模型描述错误驱动的理性错位

CodeScaler: Scaling Code LLM Training and Test-Time Inference via Execution-Free Reward Models

CodeScaler:通过无执行奖励模型扩展代码LLM训练和测试时间推断

Optimal Multi-Debris Mission Planning in LEO: A Deep Reinforcement Learning Approach with Co-Elliptic Transfers and Refueling

LEO中最优多碎片任务规划:采用共椭转移与加注的深度强化学习方法

Reinforcement-Learning-Based Assistance Reduces Squat Effort with a Modular Hip--Knee Exoskeleton

基于强化学习的辅助通过模块化髋关节-膝关节外骨骼减少深蹲的努力

MePoly: Max Entropy Polynomial Policy Optimization

MePoly:最大熵多项式策略优化

MIRA: Memory-Integrated Reinforcement Learning Agent with Limited LLM Guidance

MIRA:带有有限LLM指导的记忆集成强化学习代理

Memory-Based Advantage Shaping for LLM-Guided Reinforcement Learning

基于内存的优势塑造用于LLM引导强化学习

Graph-Neural Multi-Agent Coordination for Distributed Access-Point Selection in Cell-Free Massive MIMO

图神经多智能体协调,用于无细胞大规模多输入输入中分布式接入点选择

Learning Optimal and Sample-Efficient Decision Policies with Guarantees

学习带有保证的最优和样本效率决策策略

Whole-Brain Connectomic Graph Model Enables Whole-Body Locomotion Control in Fruit Fly

全脑连接组图模型实现了果蝇全身运动控制

Flow Actor-Critic for Offline Reinforcement Learning

离线强化学习的Flow Actor-Critic

Cross-Embodiment Offline Reinforcement Learning for Heterogeneous Robot Datasets

异构机器人数据集的跨身体离线强化学习

Mean-Field Reinforcement Learning without Synchrony

无同步的均值场强化学习

Decision Support under Prediction-Induced Censoring

预测诱导审查下的决策支持

Gradient Regularization Prevents Reward Hacking in Reinforcement Learning from Human Feedback and Verifiable Rewards

梯度正则化防止了基于人类反馈和可验证奖励的强化学习中的奖励黑客行为

Interacting safely with cyclists using Hamilton-Jacobi reachability and reinforcement learning

利用汉密尔顿-雅各比可达性和强化学习,安全与骑行者互动

TempoNet: Slack-Quantized Transformer-Guided Reinforcement Scheduler for Adaptive Deadline-Centric Real-Time Dispatchs

TempoNet:SLACK量化变压器引导强化调度器,用于自适应截止日期中心的实时调度

Flow Matching with Injected Noise for Offline-to-Online Reinforcement Learning

带注入噪声的流匹配用于离线到在线强化学习

BLM-Guard: Explainable Multimodal Ad Moderation with Chain-of-Thought and Policy-Aligned Rewards

BLM-Guard:可解释的多模态广告审核,结合思维链条和政策对齐的奖励

PRISM: Parallel Reward Integration with Symmetry for MORL

棱镜:MORL的对称并行奖励集成

Diffusing to Coordinate: Efficient Online Multi-Agent Diffusion Policies

扩散以协调:高效的在线多智能体扩散策略

Learning Smooth Time-Varying Linear Policies with an Action Jacobian Penalty

学习带有动作雅可比惩罚的平滑时间变化线性策略

Keyword: diffusion policy

There is no result