生成时间: 2026-04-10 17:14:02 (UTC+8); Arxiv 发布时间: 2026-04-10 20:00 EDT (2026-04-11 08:00 UTC+8)

今天共有 47 篇相关文章

Keyword: reinforcement learning

Reinforcement Learning with Reward Machines for Sleep Control in Mobile Networks

使用奖励机进行移动网络睡眠控制的强化学习

SubSearch: Intermediate Rewards for Unsupervised Guided Reasoning in Complex Retrieval

子检索:复杂检索中无监督引导推理的中间奖励

Dual-Rerank: Fusing Causality and Utility for Industrial Generative Reranking

双重重排序:融合因果律与效用进行工业生成式重排序

GIRL: Generative Imagination Reinforcement Learning via Information-Theoretic Hallucination Control

女孩:通过信息理论幻觉控制实现生成想象强化学习

Regret-Aware Policy Optimization: Environment-Level Memory for Replay Suppression under Delayed Harm

遗憾感知策略优化:延迟伤害下环境级内存用于重放抑制

Active Reward Machine Inference From Raw State Trajectories

从原始状态轨迹推断主动奖励机

CLEAR: Context Augmentation from Contrastive Learning of Experience via Agentic Reflection

清晰:通过能动反思对比学习经验的语境增强

ReflectRM: Boosting Generative Reward Models via Self-Reflection within a Unified Judgment Framework

ReflectRM:通过统一判断框架内的自我反思提升生成奖励模型

RL-ASL: A Dynamic Listening Optimization for TSCH Networks Using Reinforcement Learning

RL-ASL:基于强化学习的TSCH网络动态监听优化

Dual-Loop Control in DCVerse: Advancing Reliable Deployment of AI in Data Centers via Digital Twins

DCVerse中的双环路控制:通过数字孪生推进数据中心AI的可靠部署

PRIME: Training Free Proactive Reasoning via Iterative Memory Evolution for User-Centric Agent

PRIME:通过迭代记忆演化为用户中心代理免费训练主动推理

An Imperfect Verifier is Good Enough: Learning with Noisy Rewards

一个不完美的验证器就足够好:有噪音奖励的学习

Reset-Free Reinforcement Learning for Real-World Agile Driving: An Empirical Study

无重置强化学习用于现实世界敏捷驾驶:一项实证研究

Mitigating Distribution Sharpening in Math RLVR via Distribution-Aligned Hint Synthesis and Backward Hint Annealing

通过分布对齐提示合成和后向提示退火缓解数学RLVR中的分布锐化

RoboAgent: Chaining Basic Capabilities for Embodied Task Planning

RoboAgent:整合具身任务规划的基本能力

Automotive Engineering-Centric Agentic AI Workflow Framework

以汽车工程为中心的代理人工智能工作流框架

SEARL: Joint Optimization of Policy and Tool Graph Memory for Self-Evolving Agents

SEARL:策略与工具图内存的联合优化,用于自我演化代理

QaRL: Rollout-Aligned Quantization-Aware RL for Fast and Stable Training under Training--Inference Mismatch

QaRL:推广对齐量化感知强化语言,用于训练下快速稳定的训练——推断不匹配

ZeroCoder: Can LLMs Improve Code Generation Without Ground-Truth Supervision?

ZeroCoder:LLM能否在没有真实监督的情况下提升代码生成?

Learning over Forward-Invariant Policy Classes: Reinforcement Learning without Safety Concerns

超越前向不变策略类的学习:无安全问题的强化学习

AnomalyAgent: Agentic Industrial Anomaly Synthesis via Tool-Augmented Reinforcement Learning

异常代理:通过工具增强强化学习实现的智能工业异常综合

Large Language Model Post-Training: A Unified View of Off-Policy and On-Policy Learning

大型语言模型后训练:非策略与策略内学习的统一视角

On-Policy Distillation of Language Models for Autonomous Vehicle Motion Planning

自动驾驶车辆运动规划语言模型的政策提炼

Incremental Residual Reinforcement Learning Toward Real-World Learning for Social Navigation

渐进残余强化学习:面向现实世界学习的社会导航

TOOLCAD: Exploring Tool-Using Large Language Models in Text-to-CAD Generation with Reinforcement Learning

TOOLCAD:探索利用工具的大型语言模型在文本到CAD生成中与强化学习

A Decomposition Perspective to Long-context Reasoning for LLMs

从分解视角到大型语言模型的长上下文推理

PriPG-RL: Privileged Planner-Guided Reinforcement Learning for Partially Observable Systems with Anytime-Feasible MPC

PriPG-RL:针对部分可观测系统的特权规划者引导强化学习,支持随时可行MPC

Beyond Stochastic Exploration: What Makes Training Data Valuable for Agentic Search

超越随机探索:训练数据对代理搜索的价值

ViVa: A Video-Generative Value Model for Robot Reinforcement Learning

ViVa:机器人强化学习的视频生成价值模型

Value-Guidance MeanFlow for Offline Multi-Agent Reinforcement Learning

离线多智能体强化学习的价值指导平均流

Aligning Agents via Planning: A Benchmark for Trajectory-Level Reward Modeling

通过规划对齐代理:轨迹级奖励建模的基准

MedVR: Annotation-Free Medical Visual Reasoning via Agentic Reinforcement Learning

MedVR:通过代理强化学习实现无注释的医学视觉推理

OmniJigsaw: Enhancing Omni-Modal Reasoning via Modality-Orchestrated Reordering

OmniJigsaw:通过模态编排重排序增强全模态推理

HiRO-Nav: Hybrid ReasOning Enables Efficient Embodied Navigation

HiRO-Nav:混合导航实现高效的实体导航

Fundus-R1: Training a Fundus-Reading MLLM with Knowledge-Aware Reasoning on Public Data

Fundus-R1:在公共数据上以知识感知推理训练能阅读眼底的MLLM

ProMedical: Hierarchical Fine-Grained Criteria Modeling for Medical LLM Alignment via Explicit Injection

ProMedical:通过显式注入进行医学LLM对齐的层级细粒度标准建模

ASPECT:Analogical Semantic Policy Execution via Language Conditioned Transfer

ASPECT:通过语言条件传输执行类比语义策略

Synthetic Data for any Differentiable Target

任意可微目标的合成数据

NL-CPS: Reinforcement Learning-Based Kubernetes Control Plane Placement in Multi-Region Clusters

NL-CPS:基于强化学习的Kubernetes控制平面在多区域集群中的布局

Less Approximates More: Harmonizing Performance and Confidence Faithfulness via Hybrid Post-Training for High-Stakes Tasks

少近似更多:通过混合后期培训协调表现与自信忠实度,以应对高风险任务

TTVS: Boosting Self-Exploring Reinforcement Learning via Test-time Variational Synthesis

TTVS:通过测试时变分综合提升自我探索强化学习

LAMP: Lift Image-Editing as General 3D Priors for Open-world Manipulation

LAMP:提升图像编辑作为开放世界操作的通用3D先验

Faithful GRPO: Improving Visual Spatial Reasoning in Multimodal Language Models via Constrained Policy Optimization

忠实GRPO:通过受限策略优化提升多模态语言模型中的视觉空间推理

SUPERNOVA: Eliciting General Reasoning in LLMs with Reinforcement Learning on Natural Instructions

超新星:在自然指令上通过强化学习引发大型语言模型中的一般推理

Ads in AI Chatbots? An Analysis of How Large Language Models Navigate Conflicts of Interest

AI聊天机器人中的广告?大型语言模型如何应对利益冲突的分析

OpenVLThinkerV2: A Generalist Multimodal Reasoning Model for Multi-domain Visual Tasks

OpenVLThinkerV2:一个用于多领域视觉任务的通用多模态推理模型

Act Wisely: Cultivating Meta-Cognitive Tool Use in Agentic Multimodal Models

明智行动:培养智能多模态模型中的元认知工具使用

Keyword: diffusion policy

There is no result