生成时间: 2026-03-11 16:45:52 (UTC+8); Arxiv 发布时间: 2026-03-11 20:00 EDT (2026-03-12 08:00 UTC+8)

今天共有 35 篇相关文章

Keyword: reinforcement learning

VisionCreator-R1: A Reflection-Enhanced Native Visual-Generation Agentic Model

VisionCreator-R1:一种反射增强的原生视觉生成代理模型

APPLV: Adaptive Planner Parameter Learning from Vision-Language-Action Model

APPLV:自适应规划器参数学习,基于视觉-语言-行动模型

Optimizing Reinforcement Learning Training over Digital Twin Enabled Multi-fidelity Networks

优化基于数字孪生的强化学习培训,实现多保真网络

Interpretable Markov-Based Spatiotemporal Risk Surfaces for Missing-Child Search Planning with Reinforcement Learning and LLM-Based Quality Assurance

基于失踪儿童搜索规划的可解释马尔可夫时空风险曲面,结合强化学习和基于大型语言模型的质量保证

FAME: Force-Adaptive RL for Expanding the Manipulation Envelope of a Full-Scale Humanoid

名声:原力自适应强化学习,扩展了全尺寸人形生物的控范围

MAPLE: Elevating Medical Reasoning from Statistical Consensus to Process-Led Alignment

MAPLE:将医学推理从统计共识提升为过程主导的对齐

PlayWorld: Learning Robot World Models from Autonomous Play

PlayWorld:从自主游戏中学习机器人世界模型

Synergistic Directed Execution and LLM-Driven Analysis for Zero-Day AI-Generated Malware Detection

协同定向执行与大型语言模型驱动分析,用于零日AI生成的恶意软件检测

Learning Adaptive LLM Decoding

学习自适应大型语言模型解码

Overcoming Valid Action Suppression in Unmasked Policy Gradient Algorithms

在无掩蔽策略梯度算法中克服有效动作抑制

Decoupling Reasoning and Confidence: Resurrecting Calibration in Reinforcement Learning from Verifiable Rewards

推理与信心解耦:从可验证奖励中复兴强化学习校准

RubiCap: Rubric-Guided Reinforcement Learning for Dense Image Captioning

RubiCap:用于密集图片字幕的评分标准引导强化学习

Evaluate-as-Action: Self-Evaluated Process Rewards for Retrieval-Augmented Agents

作为行动评估:检索增强代理的自我评估过程奖励

Strategically Robust Multi-Agent Reinforcement Learning with Linear Function Approximation

具有线性函数近似的战略稳健多智能体强化学习

Embodied Human Simulation for Quantitative Design and Analysis of Interactive Robotics

具身人体模拟用于交互机器人定量设计与分析

Beyond Test-Time Training: Learning to Reason via Hardware-Efficient Optimal Control

超越测试时训练:通过硬件高效的最优控制学习推理

MO-Playground: Massively Parallelized Multi-Objective Reinforcement Learning for Robotics

MO-Playground:机器人学的大规模并行多目标强化学习

Social-R1: Towards Human-like Social Reasoning in LLMs

Social-R1:迈向类人社会推理 LLMs

OddGridBench: Exposing the Lack of Fine-Grained Visual Discrepancy Sensitivity in Multimodal Large Language Models

OddGridBench:揭示多模态大型语言模型中缺乏细粒度视觉差异敏感性的问题

Reward-Zero: Language Embedding Driven Implicit Reward Mechanisms for Reinforcement Learning

零奖励:语言嵌入驱动的隐性奖励机制用于强化学习

Robust Regularized Policy Iteration under Transition Uncertainty

在过渡不确定性下,稳健正则化政策迭代

SPAARS: Safer RL Policy Alignment through Abstract Exploration and Refined Exploitation of Action Space

SPAARS:通过抽象探索和精炼行动空间,实现更安全的强化学习政策对齐

Impact of Markov Decision Process Design on Sim-to-Real Reinforcement Learning

马尔可夫决策过程设计对模拟到现实强化学习的影响

SEA-Nav: Efficient Policy Learning for Safe and Agile Quadruped Navigation in Cluttered Environments

SEA-Nav:在杂乱环境中安全敏捷地四足导航的高效政策学习

MORE-R1: Guiding LVLM for Multimodal Object-Entity Relation Extraction via Stepwise Reasoning with Reinforcement Learning

更多R1:通过逐步推理与强化学习指导LVLM多模态对象-实体关系提取

Towards Unified Multimodal Interleaved Generation via Group Relative Policy Optimization

通过群相对策略优化实现统一多模态交错生成

NS-VLA: Towards Neuro-Symbolic Vision-Language-Action Models

NS-VLA:迈向神经符号视觉-语言-行动模型

GeoSolver: Scaling Test-Time Reasoning in Remote Sensing with Fine-Grained Process Supervision

GeoSolver:在细粒度过程监督下进行遥感测试时间推理的尺度化

ActiveUltraFeedback: Efficient Preference Data Generation using Active Learning

ActiveUltraFeedback:利用主动学习高效生成偏好数据

GSStream: 3D Gaussian Splatting based Volumetric Scene Streaming System

GSStream:基于3D高斯喷溅的体积场景流系统

Good Reasoning Makes Good Demonstrations: Implicit Reasoning Quality Supervision via In-Context Reinforcement Learning

良好的推理造就好演示:通过上下文强化学习实现隐性推理的优质监督

RecThinker: An Agentic Framework for Tool-Augmented Reasoning in Recommendation

RecThinker:一种用于工具增强推理的代理框架

Emerging Extrinsic Dexterity in Cluttered Scenes via Dynamics-aware Policy Learning

通过动态感知策略学习,在杂乱场景中新兴的外在灵活性

Influencing LLM Multi-Agent Dialogue via Policy-Parameterized Prompts

通过策略参数化提示影响LLM多智能体对话

When Learning Rates Go Wrong: Early Structural Signals in PPO Actor-Critic

当学习率出错:PPO演员批评中的早期结构信号

Keyword: diffusion policy

There is no result