生成时间: 2026-05-14 18:21:03 (UTC+8); Arxiv 发布时间: 2026-05-14 20:00 EDT (2026-05-15 08:00 UTC+8)

今天共有 45 篇相关文章

Keyword: reinforcement learning

SP-GCRL: Influence Maximization on Incomplete Social Graphs

SP-GCRL:不完整社会图的影响力最大化

Correct Answers from Sound Reasoning: Verifiable Process Supervision for Language Models

从合理推理中正确回答:语言模型的可验证过程监督

DelAC: A Multi-agent Reinforcement Learning of Team-Symmetric Stochastic Games

DelAC:团队对称随机博弈的多智能体强化学习

Learning When to Act: Communication-Efficient Reinforcement Learning via Run-Time Assurance

学习何时行动:通过运行时保障实现的高效沟通强化学习

Driving Intents Amplify Planning-Oriented Reinforcement Learning

驾驶意图放大了以规划为导向的强化学习

Training LLMs with Reinforcement Learning for Intent-Aware Personalized Question Answering

用强化学习培训LLM进行意图感知的个性化问答

Plan Before You Trade: Inference-Time Optimization for RL Trading Agents

交易前规划:强化学习代理的推理时间优化

Macro-Action Based Multi-Agent Instruction Following through Value Cancellation

基于宏动作的多智能体指令通过值消除跟随

ODRPO: Ordinal Decompositions of Discrete Rewards for Robust Policy Optimization

ODRPO:离散奖励的序数分解以实现稳健策略优化

3D RL-DWA: A Hybrid Reinforcement Learning and Dynamic Window Approach for Goal-Directed Local Navigation in Multi-DoF Robots

3D RL-DWA:一种混合强化学习与动态窗口方法用于多景远机器人目标导向本地导航

CoT-Guard: Small Models for Strong Monitoring

CoT-Guard:用于强监控的小型模型

Simulating Students or Sycophantic Problem Solving? On Misconception Faithfulness of LLM Simulators

是模拟学生还是讨好性解决问题?关于对LLM模拟器的忠实度的误解

Adaptive Smooth Tchebycheff Attention for Multi-Objective Policy Optimization

自适应平滑切比谢夫注意力用于多目标策略优化

Quantifying Potential Observation Missingness in Inverse Reinforcement Learning

逆强化学习中潜在观察缺失的量化

Revisiting DAgger in the Era of LLM-Agents

在LLM代理时代重新审视DAgger(大语言模型代理)

Reinforced Collaboration in Multi-Agent Flow Networks

多智能体流网络中的协作加强

A Persistence-Aware Framework for Age Violation Control in Wireless Status Update Systems

无线状态更新系统中用于年龄违规控制的持久性感知框架

JEDI: Joint Embedding Diffusion World Model for Online Model-Based Reinforcement Learning

JEDI:在线基于模型的强化学习的联合嵌入扩散世界模型

Bridging Domain Gaps with Target-Aligned Generation for Offline Reinforcement Learning

通过目标对齐生成弥合领域差距,实现离线强化学习

What to Ignore, What to React: Visually Robust RL Fine-Tuning of VLA Models

忽略什么,反应什么:视觉上稳健的强化学习VLA模型微调

ERPPO: Entropy Regularization-based Proximal Policy Optimization

ERPPO:基于熵正则化的近端策略优化

Finding the Weakest Link: Adversarial Attack against Multi-Agent Communications

寻找最薄弱环节:针对多智能体通信的对抗性攻击

Switching Successor Measures for Hierarchical Zero-shot Reinforcement Learning

层级零样本强化学习的切换后续测量

GAGPO: Generalized Advantage Grouped Policy Optimization

GAGPO:广义优势分组策略优化

An Agentic AI Framework with Large Language Models and Chain-of-Thought for UAV-Assisted Logistics Scheduling with Mobile Edge Computing

一个具备大型语言模型和思维链的代理人工智能框架,用于无人机辅助物流调度,配合移动边缘计算

Teacher-Guided Policy Optimization for LLM Distillation

教师引导的LLM提炼策略优化

Submodular Multi-Agent Policy Learning for Online Distributed Task Allocation in Open Multi-Agent Systems

用于开放多智能体系统中在线分布式任务分配的子模块化多智能体策略学习

D-VLA: A High-Concurrency Distributed Asynchronous Reinforcement Learning Framework for Vision-Language-Action Models

D-VLA:一种面向视觉-语言-行动模型的高并发分布式异步强化学习框架

GRIP-VLM: Group-Relative Importance Pruning for Efficient Vision-Language Models

GRIP-VLM:高效视觉语言模型中的群体相对重要性剪枝

Trajectory-Level Data Augmentation for Offline Reinforcement Learning

轨迹级数据增强用于离线强化学习

Q-Flow: Stable and Expressive Reinforcement Learning with Flow-Based Policy

Q-Flow:基于流程策略的稳定表达强化学习

PDCR: Perception-Decomposed Confidence Reward for Vision-Language Reasoning

PDCR:视觉-语言推理的感知分解信心奖励

Sustainable Graph Analytics Workload Scheduling with Evolutionary Reinforcement Learning in Edge-Cloud Systems

边缘云系统中可持续的图分析工作负载调度与进化强化学习

MARLIN: Multi-Agent Game-Theoretic Reinforcement Learning for Sustainable LLM Inference in Cloud Datacenters

MARLIN:多智能体博弈论强化学习,用于云数据中心可持续的大型语言模型推理

Scaling Retrieval-Augmented Reasoning with Parallel Search and Explicit Merging

利用并行搜索和显式合并进行扩展检索推理

HLS-Seek: QoR-Aware Code Generation for High-Level Synthesis via Proxy Comparative Reward Reinforcement Learning

HLS-Seek:通过代理比较奖励强化学习实现高层综合的QoR感知代码生成

Temper and Tilt Lead to SLOP: Reward Hacking Mitigation with Inference-Time Alignment

情绪和倾斜导致SLOP:通过推理时间对齐来缓解奖励黑客

Self-Supervised On-Policy Reinforcement Learning via Contrastive Proximal Policy Optimisation

通过对比近距离策略优化实现的自监督政策内强化学习

Achieving $ε^{-2}$ Sample Complexity for Single-Loop Actor-Critic under Minimal Assumptions

在最小假设下实现单循环演员-批评者样本复杂度的 $ε^{-2}$

Multi-Objective and Mixed-Reward Reinforcement Learning via Reward-Decorrelated Policy Optimization

通过奖励去相关策略优化实现多目标与混合奖励的强化学习

Robot Squid Game: Quadrupedal Locomotion for Traversing Narrow Tunnels

机器人鱿鱼游戏:四足行走狭窄隧道

SceneGraphVLM: Dynamic Scene Graph Generation from Video with Vision-Language Models

SceneGraphVLM:基于视觉语言模型的视频动态场景图生成

Tight Sample Complexity Bounds for Entropic Best Policy Identification

熵最佳策略识别的严格样本复杂度界限

Keyword: diffusion policy

BlockVLA: Accelerating Autoregressive VLA via Block Diffusion Finetuning

BlockVLA:通过块扩散微调化加速自回归VLA

CUBic: Coordinated Unified Bimanual Perception and Control Framework

CUBic:协调统一双手感知与控制框架