生成时间: 2026-06-02 20:10:25 (UTC+8); Arxiv 发布时间: 2026-06-02 20:00 EDT (2026-06-03 08:00 UTC+8)

今天共有 102 篇相关文章

Keyword: reinforcement learning

SortingHat: Redefining Operating Systems Education with a Tailored Digital Teaching Assistant

SortingHat:用定制数字教学助理重新定义操作系统教育

MindGames Arena Generalization Track: In2AI Solution with Delayed Per-Step Reward Attribution

MindGames 竞技场泛化专题:延迟每步奖励归因的 In2AI 解决方案

Reinforcement Learning for Optimal Experiment Design in Parameter Identification of Mechatronic Systems

机电一体化系统参数识别实验设计中的强化学习

From Demonstrations to Rewards: Test-Time Prompt Optimization for VLM Reward Models

从演示到奖励:VLM奖励模型的测试时间提示优化

World Models: A Comprehensive Survey of Architectures, Methodologies, Reasoning Paradigms, and Applications

世界模型:架构、方法论、推理范式及应用的全面综述

On Effectiveness and Efficiency of Agentic Tool-calling and RL Training

关于代理工具调用和强化学习培训的有效性和效率

Emergence of Exploration in Policy Gradient Reinforcement Learning via Retrying

通过重试探索政策梯度强化学习的兴起

CAST: Non-Privileged Clipped Asymmetric Self-Teaching with Advantage Flipping for GRPO

演员阵容:非特权剪辑非对称自学,带有GRPO优势翻转

Agentic Transformers Provably Learn to Search via Reinforcement Learning

代理变换器可通过强化学习学习搜索

LithoGRPO: Fast Inverse Lithography via GRPO Reinforced Flow Matching

LithoGRPO:通过GRPO增强流匹配实现的快速逆向光刻

MindZero: Learning Online Mental Reasoning With Zero Annotations

MindZero:零注释在线学习思维推理

Capability Self-Assessment: Teaching LLMs to Know Their Limits

能力自我评估:教大语言模型认识自己的极限

HOIST: Humanoid Optimization with Imitation and Sample-efficient Tuning for Manipulating Suspended Loads

HOIST:模拟和样品高效调校的人形优化,用于操控悬挂载荷

ARCA: Adapter-Residual Credit Assignment When Token Signals Degenerate

ARCA:当令牌信号退化时的适配器残余信用分配

Closed-Loop Neural Activation Control in Vision-Language-Action Models

视觉-语言-行动模型中的闭环神经激活控制

Robust Shielding for Safe Reinforcement Learning

安全强化学习的强健屏蔽

DRL-Based Pose Control for Double-Ackermann Robots Under Actuation Uncertainties

基于DRL的双阿克曼机器人在驱动不确定性下的姿态控制

Isolating LLM Lexical Bias: A Curation-Free Triangulated Metric for Preference-Stage Learning

隔离LLM词汇偏见:一种无需审核的三角测量指标用于偏好阶段学习

Drift Q-Learning

漂移Q-学习

Reinforcement Learning with Pairwise Preferences in Long-Term Decision Problems

长期决策问题中的配对偏好强化学习

Constrained Whole-Body Tracking for Humanoid Robots

人形机器人的受限全身追踪

Detector-Evasive LLM Paraphrasing via Constrained Policy Optimization

通过受限策略优化进行检测器-规避式大型语言模型的改写

PR2: Predictive Routing Replay for MoE-Based LLM Reinforcement Learning

PR2:基于MoE的大型语言模型强化学习的预测路由重放

Topology-Aware State Abstraction with Tangle Cores for Markov Decision Processes

具有纠结核心的拓扑感知状态抽象,用于马尔可夫决策过程

SDR: Set-Distance Rewards for Radiology Report Generation

SDR:放射报告生成的集合距离奖励

DriveAnchor: Progressive Anchor-based Flow Learning for Autonomous Driving Planning

DriveAnchor:基于Anchor的渐进式流程学习,用于自动驾驶规划

Learning to Retrieve: Dual-Level Long-Term Memory for Text-to-SQL Agents

学习检索:文本转SQL代理的双层长期记忆

Interpretable Policy Distillation for Power Grid Topology Control

电网拓扑控制的可解释策略提炼

DeepLatent: Think with Images via Parallel Latent Visual Reasoning

DeepLatent:通过并行潜在视觉推理用图像思考

SPADER: Step-wise Peer Advantage with Diversity-Aware Exploration Rewards for Multi-Answer Question Answering

SPADER:逐步的同伴优势,多题解答的多样性意识探索奖励

CARE-RL: Capability-Aware Reinforcement Learning for Mitigating Cross-Domain Conflicts

CARE-RL:能力感知强化学习以缓解跨域冲突

Global-Local Attention Decomposition for Terrain Encoding in Humanoid Perceptive Locomotion

类人生物感知运动中地形编码的全局-局部注意力分解

The Paradox of Outcome Optimization: A Causal Information-Theoretic Bound on Reasoning Shortcuts in LLMs

结果优化的悖论:大型语言模型推理捷径的因果信息论界限

Regularized Offline Policy Optimization with Posterior Hybrid Bayesian Belief

正则化离线策略优化与后验混合贝叶斯信念

Shape Your Body: Value Gradients for Multi-Embodiment Robot Design

塑造你的身体:多体型机器人设计的价值梯度

MOSAIC: Modular Orchestration for Structured Agentic Intelligence and Composition

MOSAIC:结构化智能智能与合成的模块化编排

Internalize the Temperature: On-Policy Self-Distillation as Policy Reheater for Reinforcement Learning

内化温度:策略自蒸馏作为强化学习策略加热器

GIRL-DETR: Gradient-Isolated Reinforcement Learning for Video Moment Retrieval

GIRL-DETR:视频时刻检索的梯度孤立强化学习

Behavior-Invariant Task Representation Learning with Transformer-based World Models for Offline Meta-Reinforcement Learning

基于Transformer的世界模型进行离线元强化学习的行为不变任务表示学习

Decoupled Behavioral Cloning for Scalable Inductive Generalization in RL from Specifications

基于规范,强化学习中可扩展归纳推广的解耦行为克隆

Certificate-Guided Evaluation of Reinforcement Learning Generalization

证书引导的强化学习泛化评估

Meta-Black-Box Optimization with Ensemble Surrogate Modeling for Robustness-Accuracy Trade-off within SAEA

在SAEA中结合集合代理建模的元黑盒优化,实现鲁棒性与准确性的权衡

Enhancing LLM Metacognition via Cognitive Pairwise Training

通过认知成对训练提升LLM元认知

Task diversity produces systematic transfer but inhibits continual reinforcement learning

任务多样性产生系统迁移,但抑制持续强化学习

Ryze: Evidence-Enriched Data Synthesis from Biomedical Papers

Ryze:生物医学论文中的证据丰富数据综合

Generative Multi-Robot Motion Planning via Diffusion Modeling with Multi-Agent Reinforcement Learning Guidance

通过扩散建模与多智能体强化学习指导实现生成式多机器人运动规划

Explainable deep reinforcement learning reveals energy-efficient control strategies for turbulent drag reduction

可解释的深度强化学习揭示了节能的湍流阻力减缓控制策略

MedGym:A Unified Continuous-Time Benchmark for Dynamic Medical Treatment Reinforcement Learning

MedGym:动态医疗治疗强化学习的统一连续时间基准

OPD+: Rethinking the Advantage Design for On-Policy Distillation

OPD+:重新思考政策提炼的优势设计

ExpWeaver: LLM Agents Learn from Experience via Latent RAG

ExpWeaver:LLM代理通过潜在RAG从经验中学习

Interaction-Limited Safe Continuous-Time RL for Dynamical Medical Treatment

限制交互安全连续时间强化学习用于动态医学治疗

Before the Model Learns the Bug:Fuzzing RLVR Verifiers

在模型学习漏洞之前:RLVR验证器在模糊中

MViewRouter: Internalizing Geometric Equivariance via Multi-view Alternating Attention for Combinatorial Routing

MViewRouter:通过多视角交替注意力内化几何等变性,实现组合路由

CAREAgent: Clinical Agent with Structured Reasoning and Tool-Integrated for Order Generation

CAREAgent:具结构化推理和工具集成的临床代理,用于订单生成

MiCU: End-to-End Smart Home Command Understanding with Large Language Model

MiCU:基于大型语言模型实现端到端智能家居指令理解

From Reward-Free Representations to Preferences: Rethinking Offline Preference-Based Reinforcement Learning

从无奖励表征到偏好:重新思考基于偏好的离线强化学习

Lagrangian Perturbation Diffusion Steering: Latent Reinforcement Learning for Generative Policies

拉格朗日扰动扩散引导:生成策略的潜在强化学习

Expected Value Alignment for Generative Reward Modeling in Formal Mathematics Verification

形式数学验证中生成奖励建模的期望值对齐

Deft Scheduling of Dynamic Cloud Workflows with Varying Deadlines via Mixture-of-Experts

通过专家混合灵活安排动态云工作流程,灵活安排不同截止日期

Fine-Tuning Diffusion Models for Molecular Generation via Reinforcement Learning and Fast Sampling

通过强化学习和快速采样微调分子生成的扩散模型

RLVR without Ineffective Samples: Group Prioritized Off-Policy Optimization for LLM Reasoning

无无效样本的RLVR:针对LLM推理的组优先级非策略优化

Digital Twin-Assisted Adaptive Multi-Agent DRL for Intelligent Spectrum and Resource Management in Open-RAN UAV-Enabled 6G Networks

数字孪生辅助自适应多智能体日程,用于开放式无人机支持的6G网络中的智能频谱和资源管理

S2M-Trek: From Single to Multi-Sphere Transport via Per-Frame Deep Sets on a Wheel-Legged Robot

S2M-Trek:通过轮腿机器人的每帧深度套装,从单球到多球体传输

All Models are Wrong, Knowing Where is Useful: On Model Uncertainty in Reinforcement Learning

所有模型都是错误的,知道哪里有用:关于强化学习中的模型不确定性

Cross-lingual Self-Consistency for Multilingual Reasoning with Language Models

跨语言自洽性用于多语言推理与语言模型

OmniOPD: Logit-Free On-Policy Distillation via Speculative Verification

OmniOPD:通过推测验证进行无Logit的政策提炼

Crazyflow: An Accurate, GPU-Accelerated, Differentiable Drone Simulator in JAX

Crazyflow:一款在JAX中实现的精确、GPU加速、可微分的无人机模拟器

Hierarchical Semantic-Augmented Navigation: Optimal Transport and Graph-Driven Reasoning for Vision-Language Navigation

分层语义增强导航:视觉语言导航的最佳传输与图驱动推理

Physics-Informed Modeling and Control of Emergent Behaviors in Robot Swarms

基于物理的建模与机器人群体中涌现行为的控制

TRON: Targeted Rule-Verifiable Online Environments for Visual Reasoning RL

TRON:针对视觉推理的可规则验证在线环境

ReSkill: Reconciling Skill Creation with Policy Optimization in Agentic RL

再技能:在智能强化学习中调和技能创建与策略优化

RDA: Reward Design Agent for Reinforcement Learning

RDA:强化学习奖励设计代理

TriAlign: Towards Universal Truth Consistency in Personalized LLM Alignment

TriAlign:迈向个性化大语言模型对齐中的普遍真理一致性

MetaForge: A Self-Evolving Multimodal Agent that Retrieves, Adapts, and Forges Tools On Demand

MetaForge:一款自我进化的多模态智能体,能够按需检索、适应并锻造工具

CAPF: Guiding Search-Agent Rollouts with Credit-Attenuated Privileged Feedback

CAPF:以信用减弱特权反馈指导搜索代理推广

From Global Policies to Local Strategies: Multi-Objective Optimization of Resource-Specific Handover Policies

从全球政策到本地策略:资源特定切换政策的多目标优化

Task-Induced Representational Invariances Depend on Learning Objective in Deep RL

任务诱导的表征不变性依赖于深度强化学习中的学习目标

Comparing ML-Specific and General Python Code Smells Across Project Characteristics

比较机器学习专用和通用Python代码在不同项目特性上的表现

Community-Aware Assessment of Social Textual Engagement and Resonance: A Human-Centric Perspective on User-Generated Content Evaluation

社区意识评估社会文本参与与共鸣:以人为本的视角下用户生成内容评估

HMPO: Hybrid Median-length Policy Optimization for Chain-of-Thought Compression

HMPO:用于思维链压缩的混合中位数长度策略优化

Randomized Least Squares Value Iteration itself is Joint Differentially Private

随机最小二乘值迭代本身是联合微分私有的

AI-Based KPI Prediction Methods in Future 6G Networks: A Survey

基于人工智能的未来6G网络KPI预测方法:一项调查

MT-EditFlow: Reinforcement Learning for Multi-Turn Image Editing with Flow Matching

MT-EditFlow:多回合图像编辑的强化学习,支持流程匹配

SafeMCP: Proactive Power Regulation for LLM Agent Defense via Environment-Grounded Look-Ahead Reasoning

SafeMCP:通过环境基础的前瞻性推理,主动监管LLM代理防御

RL-ACRGNet: Reinforcement Learning-Based Chest Radiology Report Generation Network

RL-ACRGNet:基于强化学习的胸部放射报告生成网络

Explainable Data-driven Deep Reinforcement Learning Methods for Optimal Energy Management in Buildings

可解释的数据驱动深度强化学习方法,用于建筑物中最优的能源管理

Network Distributed Multi-Agent Reinforcement Learning for Consensus Control of Quadcopters

网络分布式多智能体强化学习用于四旋翼共识控制

Learning When Not to Act: Mitigating Tool Abuse in Agentic Reinforcement Learning

学习何时不采取行动:减轻智能强化学习中工具滥用

Coherent Off-Policy Improvement of Large Behavior Models with Learned Rewards

具学习奖励的大型行为模型的连贯非策略改进

Faster Synchronous On-Policy RL via Straggler-Aware Group Sizing

通过落后者感知组大小实现更快的同步策略强化学习

ResMerge: Residual-based Spectral Merging of Large Language Models

ResMerge:基于残差的大型语言模型谱合并

Dynamics Are Learned, Not Told: Semi-Supervised Discovery of Latent Dynamics Geometries For Zero-Shot Policy Adaptation

动力学是学习的,而非被告知:半监督发现潜在动力学几何以实现零射策略适应

Towards Precise Intent-Aligned VLA Aerial Navigation via Expert-Guided GRPO

通过专家引导GRPO实现精确意图对齐VLA航天导航

Coordination Graphs for Constrained Multi-Agent Reinforcement Learning

受限多智能体强化学习的协调图

SIRI: Self-Internalizing Reinforcement Learning with Intrinsic Skills for LLM Agent Training

SIRI:基于内在技能的自我内化强化学习,用于LLM代理培训

Harness-1: Reinforcement Learning for Search Agents with State-Externalizing Harnesses

束带-1:具备状态外化束带的搜索代理的强化学习

Policy and World Modeling Co-Training for Language Agents

语言代理的政策与世界建模共同培训

A Local Perturbation Theory for Cross-Domain Interference and Recovery in Multi-Domain RL

多域强化学习中跨域干涉与恢复的局部微扰理论

Active Exploring like a Pigeon: Reinforcing Spatial Reasoning via Agentic Vision-Language Models

像鸽子一样主动探索:通过能动视觉语言模型强化空间推理

Learning When to Translate for Multilingual Reasoning

学习何时进行多语言推理翻译

Keyword: diffusion policy

From Noise to Control: Parameterized Diffusion Policies

从噪声到控制:参数化扩散政策

Set-Supervised Diffusion Policy: Learning Action-Chunking Diffusion through Corrections

集合监督扩散策略:通过修正学习动作分块扩散