生成时间: 2025-10-20 16:30:25 (UTC+8); Arxiv 发布时间: 2025-10-20 20:00 EDT (2025-10-21 08:00 UTC+8)

今天共有 45 篇相关文章

Keyword: reinforcement learning

ES-C51: Expected Sarsa Based C51 Distributional Reinforcement Learning Algorithm

ES-C51:预期的基于 Sarsa 的 C51 分布强化学习算法

Composition-Grounded Instruction Synthesis for Visual Reasoning

用于视觉推理的基于组合的指令合成

Internalizing World Models via Self-Play Finetuning for Agentic RL

通过代理 RL 的 Self-Play Finetuning 内化世界模型

Directional Reasoning Injection for Fine-Tuning MLLMs

用于微调 MLLM 的定向推理注入

Learn to Change the World: Multi-level Reinforcement Learning with Model-Changing Actions

学习改变世界:具有模型改变行动的多层次强化学习

DLER: Doing Length pEnalty Right - Incentivizing More Intelligence per Token via Reinforcement Learning

DLER:正确执行长度 pEnalty - 通过强化学习激励每个 token 的更多智能

Procedural Game Level Design with Deep Reinforcement Learning

深度强化学习的程序化游戏关卡设计

Navigating the consequences of mechanical ventilation in clinical intensive care settings through an evolutionary game-theoretic framework

通过进化博弈论框架探讨临床重症监护环境中机械通气的后果

Policy Transfer Ensures Fast Learning for Continuous-Time LQR with Entropy Regularization

策略转移确保具有熵正则化的连续时间 LQR 的快速学习

RM-RL: Role-Model Reinforcement Learning for Precise Robot Manipulation

RM-RL:用于精确机器人纵的角色模型强化学习

Structure-R1: Dynamically Leveraging Structural Knowledge in LLM Reasoning through Reinforcement Learning

Structure-R1:通过强化学习动态利用结构知识在LLM推理中

Soundness-Aware Level: A Microscopic Signature that Predicts LLM Reasoning Potential

健全性感知水平:预测法学硕士推理潜力的微观特征

Dual-Weighted Reinforcement Learning for Generative Preference Modeling

用于生成偏好建模的双重加权强化学习

AutoGraph-R1: End-to-End Reinforcement Learning for Knowledge Graph Construction

AutoGraph-R1:用于知识图谱构建的端到端强化学习

Infinity Parser: Layout Aware Reinforcement Learning for Scanned Document Parsing

Infinity Parser:用于扫描文档解析的布局感知强化学习

Towards Flash Thinking via Decoupled Advantage Policy Optimization

通过解耦优势策略优化实现闪光思维

Towards Automated Chicken Deboning via Learning-based Dynamically-Adaptive 6-DoF Multi-Material Cutting

通过基于学习的动态自适应 6-DoF 多材料切割实现鸡的自动化剔骨

Towards Robust Zero-Shot Reinforcement Learning

迈向稳健的零样本强化学习

Advancing Routing-Awareness in Analog ICs Floorplanning

提高模拟IC布局规划中的布线意识

MARS: Reinforcing Multi-Agent Reasoning of LLMs through Self-Play in Strategic Games

MARS:通过战略游戏中的自我游戏强化法学硕士的多智能体推理

Safe, Efficient, and Robust Reinforcement Learning for Ranking and Diffusion Models

用于排名和扩散模型的安全、高效和鲁棒的强化学习

Select Less, Reason More: Prioritizing Evidence Purity for Video Reasoning

少选多选:优先考虑视频推理的证据纯度

VDRive: Leveraging Reinforced VLA and Diffusion Policy for End-to-end Autonomous Driving

VDRive:利用增强型VLA和扩散策略实现端到端自动驾驶

Expediting Reinforcement Learning by Incorporating Knowledge About Temporal Causality in the Environment

通过整合有关环境中时间因果关系的知识来加速强化学习

OffSim: Offline Simulator for Model-based Offline Inverse Reinforcement Learning

OffSim:基于模型的离线逆强化学习的离线模拟器

HarmRLVR: Weaponizing Verifiable Rewards for Harmful LLM Alignment

HarmRLVR:将有害的 LLM 调整的可验证奖励武器化

The Road Less Traveled: Enhancing Exploration in LLMs via Sequential Sampling

人迹罕至的道路:通过顺序采样增强法学硕士的探索

Taming the Judge: Deconflicting AI Feedback for Stable Reinforcement Learning

驯服法官:消除冲突的人工智能反馈以实现稳定的强化学习

JudgeSQL: Reasoning over SQL Candidates with Weighted Consensus Tournament

JudgeSQL:通过加权共识锦标赛对 SQL 候选者进行推理

HEADER: Hierarchical Robot Exploration via Attention-Based Deep Reinforcement Learning with Expert-Guided Reward

标题:通过基于注意力的深度强化学习和专家指导奖励进行分层机器人探索

ProofOptimizer: Training Language Models to Simplify Proofs without Human Demonstrations

ProofOptimizer:训练语言模型以简化证明,无需人工演示

Direct Preference Optimization with Unobserved Preference Heterogeneity: The Necessity of Ternary Preferences

具有未观测偏好异质性的直接偏好优化:三元偏好的必要性

Cost-Aware Retrieval-Augmentation Reasoning Models with Adaptive Retrieval Depth

具有自适应检索深度的成本感知检索-增强推理模型

ProSh: Probabilistic Shielding for Model-free Reinforcement Learning

ProSh:用于无模型强化学习的概率屏蔽

RLAF: Reinforcement Learning from Automaton Feedback

Self-evolving expertise in complex non-verifiable subject domains: dialogue as implicit meta-RL

复杂不可验证学科领域的自我发展专业知识:作为隐式元修复的对话

DexCanvas: Bridging Human Demonstrations and Robot Learning for Dexterous Manipulation

DexCanvas:连接人类演示和机器人学习以实现灵巧作

Cavity Duplexer Tuning with 1d Resnet-like Neural Networks

使用类一维 Resnet 神经网络进行腔体双工器调谐

FIDDLE: Reinforcement Learning for Quantum Fidelity Enhancement

FIDDLE:用于增强量子保真度的强化学习

Learning Correlated Reward Models: Statistical Barriers and Opportunities

学习相关奖励模型:统计障碍和机会

BLIP3o-NEXT: Next Frontier of Native Image Generation

BLIP3o-NEXT:原生图像生成的下一个前沿

InfiMed-ORBIT: Aligning LLMs on Open-Ended Complex Tasks via Rubric-Based Incremental Training

InfiMed-ORBIT:通过基于评分标准的增量训练在开放式复杂任务上调整法学硕士

PokeeResearch: Effective Deep Research via Reinforcement Learning from AI Feedback and Robust Reasoning Scaffold

PokeeResearch:通过人工智能反馈和稳健推理支架的强化学习进行有效的深度研究

Keyword: diffusion policy

VDRive: Leveraging Reinforced VLA and Diffusion Policy for End-to-end Autonomous Driving

VDRive:利用增强型VLA和扩散策略实现端到端自动驾驶

VO-DP: Semantic-Geometric Adaptive Diffusion Policy for Vision-Only Robotic Manipulation

VO-DP:用于纯视觉机器人作的语义几何自适应扩散策略