生成时间: 2026-05-18 19:46:33 (UTC+8); Arxiv 发布时间: 2026-05-18 20:00 EDT (2026-05-19 08:00 UTC+8)

今天共有 40 篇相关文章

Keyword: reinforcement learning

ICRL: Learning to Internalize Self-Critique with Reinforcement Learning

ICRL:通过强化学习学习内化自我批评

Training on Documents About Monitoring Leads to CoT Obfuscation

关于监控文档的培训会导致CoT混淆

Solvita: Enhancing Large Language Models for Competitive Programming via Agentic Evolution

Solvita:通过智能进化增强大型语言模型以实现竞争性编程

LEAP: Trajectory-Level Evaluation of LLMs in Iterative Scientific Design

LEAP:迭代科学设计中LLM的轨迹级评估

Controllable Molecular Generative Foundation Models

可控分子生成基础模型

Video Models Can Reason with Verifiable Rewards

视频模特可以推理可验证的奖励

GRLO: Towards Generalizable Reinforcement Learning in Open-Ended Environments from Zero

GRLO:从零开始迈向开放式环境中的通用强化学习

Residual Reinforcement Learning for Robot Teleoperation under Stochastic Delays

随机延迟下机器人远程操作的残余强化学习

Terrain Consistent Reference-Guided RL for Humanoid Navigation Autonomy

地形一致参考引导强化学习,用于类人生物导航自主性

DiffVAS: Diffusion-Guided Visual Active Search in Partially Observable Environments

DiffVAS:部分可观测环境中的扩散引导视觉主动搜索

Task-Semantic Graph-Driven Distributed Agent Networking for Underwater Target Tracking

任务语义图驱动分布式智能体网络用于水下目标跟踪

Rethinking Neural Network Learning Rates: A Stackelberg Perspective

重新思考神经网络学习率:斯塔克尔伯格视角

NavRL++: A System-Level Framework for Improving Sim-to-Real Transfer in Reinforcement Learning-Based Robot Navigation

NavRL++:一个用于提升基于强化学习的机器人导航模拟到现实转移的系统级框架

AstraFlow: Dataflow-Oriented Reinforcement Learning for Agentic LLMs

AstraFlow:面向数据流的智能大型语言模型强化学习

Calibrating LLMs with Semantic-level Reward

用语义级奖励校准大型语言模型

Offline Reinforcement Learning with Universal Horizon Models

基于通用地平线模型的离线强化学习

Sharp Spectral Thresholds for Logit Fixed Points

Logit 不动点的锐谱阈值

PCASim: Promptable Closed-loop Adversarial Simulation for Urban Traffic Environment

PCASim:城市交通环境下的可提示闭环对抗模拟

Tighter Regret Bounds for Contextual Action-Set Reinforcement Learning

更严格的上下文动作集强化学习的遗憾界限

Distributed Zeroth-Order Policy Gradient for Networked Multi-agent Reinforcement Learning from Human Feedback

分布式零阶策略梯度用于网络多智能体强化学习,基于人类反馈

Scale: Deep Reinforcement Learning for Container Scheduling in Serverless Edge Computing

规模化:无服务器边缘计算中容器调度的深度强化学习

Learning Dynamic Pick-and-Place for a Legged Manipulator

学习带腿操作者的动态选择与定位

Nudging Beyond the Comfort Zone: Efficient Strategy-Guided Exploration for RLVR

突破舒适区:RLVR高效策略引导探索

ALSO: Adversarial Online Strategy Optimization for Social Agents

另请期待:社会代理的对抗性在线策略优化

Lamarckian Inheritance in Dynamic Environments: How Key Variables Affect Evolutionary Dynamics

动态环境中的拉马克遗传:关键变量如何影响进化动力学

Embedding-perturbed Exploration Preference Optimization for Flow Models

嵌入扰动的流动模型探索偏好优化

Access Timing as Scaffolding: A Reinforcement Learning Approach to GenAI in Education

作为脚手架的时机获取:教育中生成式人工智能的强化学习方法

A Multi-Layer Cloud-IDS Pipeline with LLM and Adaptive Q-Learning Calibration

多层云-IDS流水线,配备LLM和自适应Q-学习校准

Dynamic Plasma Shape Control with Arbitrary Sensor Subsets

带任意传感器子集的动态等离子体形状控制

From Failure to Feedback: Group Revision Unlocks Hard Cases in Object-Level Grounding

从失败到反馈:群组修订解锁对象级接地的难题

Imperfect World Models are Exploitable

不完美世界模型是可被利用的

PAGER: Bridging the Semantic-Execution Gap in Point-Precise Geometric GUI Control

寻呼器:弥合点精准几何图形界面控制中的语义与执行差距

WorldVLN: Autoregressive World Action Model for Aerial Vision-Language Navigation

WorldVLN:自回归世界行动模型用于航空视觉语言导航

OHP-RL: Online Human Preference as Guidance in Reinforcement Learning for Robot Manipulation

OHP-RL:在线人类偏好作为机器人操作强化学习指导

Reference-Free Reinforcement Learning Fine-Tuning for MT: A Seq2Seq Perspective

无引用强化学习机器学习的微调:Seq2Seq 视角

Adaptive Outer-Loop Control of Quadrotors via Reinforcement Learning

通过强化学习实现四旋翼的自适应外环控制

Mind Dreamer: Untethering Imagination via Active Latent Intervention on Latent Manifolds

心灵梦者:通过对潜在流形的主动潜在干预解开想象力

Look Before You Leap: Autonomous Exploration for LLM Agents

三思而后行:LLM代理的自主探索

Learn Where Outcomes Diverge: Efficient VLA RL via Probabilistic Chunk Masking

学习结果分歧之处:通过概率分块掩蔽实现高效的VLA RL

Argus: Evidence Assembly for Scalable Deep Research Agents

Argus:可扩展深度研究代理的证据汇编

Keyword: diffusion policy

There is no result