生成时间: 2026-06-12 19:44:05 (UTC+8); Arxiv 发布时间: 2026-06-12 20:00 EDT (2026-06-13 08:00 UTC+8)

今天共有 30 篇相关文章

Keyword: reinforcement learning

ReCal: Reward Calibration for RL-based LLM Routing

ReCal:基于强化学习的大型语言模型路由奖励校准

Boosting Direct Preference Optimization with Penalization

通过惩罚提升直接偏好优化

Foresight: Iterative Reasoning About Clues that Matter for Navigation

前瞻性:对导航关键线索的反复推理

Keep Policy Gradient in Charge: Sibling-Guided Credit Distillation for Long-Horizon Tool-Use Agents

保持政策梯度主导:长期工具使用代理的兄弟姐妹引导信用蒸馏

Individual Control Barrier Functions-Guided Diffusion Model for Safe Offline Multi-Agent Reinforcement Learning

个别控制屏障功能引导扩散模型,用于安全离线多智能体强化学习

Stubborn: A Streamlined and Unified Reinforcement Learning Framework for Robust Motion Tracking and Fall Recovery for Humanoids

Stubborn:一个简化统一的强化学习框架,用于强健的人体运动追踪和跌倒恢复

Topical Phase Transitions in Artificial Intelligence Research: Large-Scale Evidence and an Early-Warning Signature for Emerging Topics

人工智能研究中的局部相变:大规模证据与新兴话题的预警信号

Fantastic Scientific Agents and How to Build Them: AgentBuild for Rietveld Refinement

出色的科学特工及其构建方法:AgentBuild for Rietveld Refinement

Direct Preference Optimization for Chatbot Fine-Tuning: An Empirical Study

聊天机器人微调的直接偏好优化:一项实证研究

Bridging Modal Isolation in Interleaved Thinking: Supervising Modality Transitions via Stepwise Reinforcement

在交错思维中弥合模态孤立:通过逐步强化监督模态转换

Learning to Adapt: Representation-Based Reinforcement Learning for Multi-Task Skill Transfer

学习适应:基于表征的强化学习用于多任务技能转移

PolicyGuard: Towards Test-time and Step-level Adversary Defense for Reinforcement Learning Agent

PolicyGuard:为强化学习代理实现测试时和步级对抗防御

SENTINEL: Failure-Driven Reinforcement Learning for Training Tool-Using Language Model Agents

SENTINEL:针对使用语言模型代理工具训练的失败驱动强化学习

Demystifying Hidden-State Recurrence: Switchable Latent Reasoning with On-Policy Reinforcement Learning

揭开隐藏状态重现的神秘面纱:可切换潜在推理与策略强化学习

Select and Improve: Understanding the Mechanics of Post-Training for Reasoning

选择与改进:理解推理培训后机制

Redesigning Regularization for Effective Policy Smoothing

重新设计正则化以实现有效策略平滑

Mental-R1: Aligning LLM Reasoning for Mental Health Assessment

心理-R1:将LLM推理与心理健康评估对齐

Reasoning for Mobile User Experience with Multimodal LLMs: Task, Benchmark, and Approach

多模态大型语言模型的移动用户体验推理:任务、基准与方法

Understanding helpfulness and harmless tension in reward models

理解奖励模型中的帮助性和无害紧张

From Verdict to Process: Agentic Reinforcement Learning for Multi-Stage Fact Verification

从裁决到过程:多阶段事实验证的能动强化学习

ReFree: Towards Realistic Co-Speech Video Generation via Reward-Free RL and Multilevel Speech Guidance

ReFree:通过无奖励的强化学习和多级语音指导,迈向真实的共话视频生成

ReSum: Synergizing LLM Reasoning and Summarization with Reinforcement Learning

ReSum:将LLM推理与总结与强化学习结合起来

From Passive Generation to Investigation: A Proactive Scientific Peer Review Agent

从被动生成到研究:主动科学同行评审的推动者

IterCAD: An Iterative Multimodal Agent for Visually-Grounded CAD Generation and Editing

IterCAD:一款用于视觉基础CAD生成与编辑的迭代多模态代理

Reinforcement Learning for Neural Model Editing

神经模型编辑的强化学习

Multi-Agent Reinforcement Learning from Delayed Marketplace Feedback for Objective-Weight Adaptation in Three-Sided Dispatch

多智能体强化学习:基于延迟市场反馈的目标权重适应三方调度

Beyond Runtime Enforcement: Shield Synthesis as Defensibility Analysis for Adversarial Networks

超越运行时执行:盾牌合成作为对抗网络防御性分析

Improving Robotic Generalist Policies via Flow Reversal Steering

通过流程逆转引导改进机器人通用策略

Mana: Dexterous Manipulation of Articulated Tools

魔力:灵活操控关节工具

Keyword: diffusion policy

Action-Effect Memory Pretraining for Robot Manipulation

机器人操作的动作效应记忆预训练