生成时间: 2026-05-06 18:21:37 (UTC+8); Arxiv 发布时间: 2026-05-06 20:00 EDT (2026-05-07 08:00 UTC+8)

今天共有 27 篇相关文章

Keyword: reinforcement learning

An End-to-End Framework for Building Large Language Models for Software Operations

构建大型软件操作语言模型的端到端框架

Delay, Plateau, or Collapse: Evaluating the Impact of Systematic Verification Error on RLVR

延迟、平台期或崩溃:评估系统性验证误差对RLVR的影响

Generate, Filter, Control, Replay: A Comprehensive Survey of Rollout Strategies for LLM Reinforcement Learning

生成、过滤、控制、重放:大型语言模型强化学习推广策略的全面综述

Healthcare AI GYM for Medical Agents

医疗智能体医疗人工智能健身房

Exploring Pass-Rate Reward in Reinforcement Learning for Code Generation

探索强化学习中的通过率奖励以实现代码生成

Reward Hacking Benchmark: Measuring Exploits in LLM Agents with Tool Use

奖励黑客基准:利用工具衡量LLM代理的漏洞利用

Joint Energy Management and Coordinated AIGC Workload Scheduling for Distributed Data Centers: A Diffusion-Aided Reward Shaping Approach

分布式数据中心的联合能源管理与协调AIGC工作负载调度:一种扩散辅助奖励塑造方法

Taming the Curses of Multiagency in Robust Markov Games with Large State Space through Linear Function Approximation

通过线性函数近似克服大状态空间鲁棒马尔可夫博弈中的多代理诅咒

MARS-DA: A Hierarchical Reinforcement Learning Framework for Risk-Aware Multi-Agent Bidding in Power Grids

MARS-DA:一个用于电网中风险感知多代理竞价的层级强化学习框架

Terminus-4B: Can a Smaller Model Replace Frontier LLMs at Agentic Execution Tasks?

Terminus-4B:更小的模型能否取代 Frontier 的代理执行任务中的大型语言模型?

DGPO: Distribution Guided Policy Optimization for Fine Grained Credit Assignment

DGPO:细粒度信贷分配的分销指导政策优化

QoS Assurance Mechanism for 5G Network Slicing Based on the Deep Reinforcement Learning PPO Algorithm

基于深度强化学习PPO算法的5G网络切片QoS保证机制

Learning Reactive Dexterous Grasping via Hierarchical Task-Space RL Planning and Joint-Space QP Control

通过层级任务空间强化学习规划和联合空间QP控制,学习反应性灵巧抓取

Discovering Reinforcement Learning Interfaces with Large Language Models

发现强化学习与大型语言模型的接口

Quantum Hierarchical Reinforcement Learning via Variational Quantum Circuits

通过变分量子电路实现的量子分层强化学习

FINER-SQL: Boosting Small Language Models for Text-to-SQL

FINER-SQL:提升文本转SQL的小型语言模型

MHPR: Multidimensional Human Perception and Reasoning Benchmark for Large Vision-Languate Models

MHPR:大型视觉-语言模型的多维人类感知与推理基准

Real-Time Evaluation of Autonomous Systems under Adversarial Attacks

在对抗性攻击下自主系统进行实时评估

Vanishing L2 regularization for the softmax Multi Armed Bandit

软极限多臂强盗的消失L2正则化

What You Think is What You See: Driving Exploration in VLM Agents via Visual-Linguistic Curiosity

你所想即所见:通过视觉语言好奇心驱动VLM代理的探索

Natural Language Processing: A Comprehensive Practical Guide from Tokenisation to RLHF

自然语言处理:从分词到RLHF的全面实用指南

SOAR: Real-Time Joint Optimization of Order Allocation and Robot Scheduling in Robotic Mobile Fulfillment Systems

SOAR:机器人移动履约系统中订单分配和机器人调度的实时联合优化

SigLoMa: Learning Open-World Quadrupedal Loco-Manipulation from Ego-Centric Vision

SigLoMa:从自我中心视角学习开放世界四足行走操控

Mechanical Conscience: A Mathematical Framework for Dependability of Machine Intelligenc

机械意识:机器智能可靠性的数学框架

Correct Is Not Enough: Training Reasoning Planners with Executor-Grounded Rewards

正确还不够:用执行者为基础的奖励培训推理规划师

Mitigating False Positives in Static Memory Safety Analysis of Rust Programs via Reinforcement Learning

通过强化学习减少Rust程序静态内存安全分析中的误报

OpenSeeker-v2: Pushing the Limits of Search Agents with Informative and High-Difficulty Trajectories

OpenSeeker-v2:推动搜索代理的极限,提供信息丰富且高难度的轨迹

Keyword: diffusion policy

There is no result