生成时间: 2026-05-27 19:41:31 (UTC+8); Arxiv 发布时间: 2026-05-27 20:00 EDT (2026-05-28 08:00 UTC+8)

今天共有 59 篇相关文章

Keyword: reinforcement learning

ATOM: Instantiating Budget-Controllable Multi-Agent Collaboration via Nucleus-Electron Hierarchy

ATOM:通过核电子层级实现预算可控的多智能体协作

GAC: Noise-Aware Adaptive Mixing for Hybrid SFT-RL Post-Training

GAC:噪声感知自适应混合混合SFT-RL后训练

Unified Neural Scaling Laws

统一神经尺度定律

Scaling World-Model Reinforcement Learning Through Diffusion Policy Optimization

通过扩散策略优化扩展世界模型强化学习

Decoupled Delay Compensation: Enhancing Pre-trained MARL Policies via Learned Dynamics Filtering

解耦延迟补偿:通过学习动力学过滤增强预训练的MARL策略

MechRL: Reinforcement Learning Agents Perform Circuit Discovery for Mechanistic Interpretability

MechRL:强化学习代理执行电路发现以实现机制解释

Balancing Plasticity and Stability with Fast and Slow Successor Features

平衡可塑性和稳定性与快速与慢速继任特性

Exploiting Local Dynamics Regularity for Reusable Skills in Offline Hierarchical RL

利用局部动力学正则性实现离线层级强化学习中的可复用技能

Credit-assigned Policy Gradient for Early Stage Retrieval in Two-stage Ranking

两阶段排名中早期检索的信用分配策略梯度

When Does Deep RL Beat Calibrated Baselines? A Benchmark Study on Adaptive Resource Control

深度强化学习何时能击败校准基线?一项关于适应性资源控制的基准研究

Design First, Code Later: Aesthetically Pleasing Template-Free Slides Generation

先设计,后写代码:美观无模板幻灯片生成

Robust Koopman Control Barrier Filters for Safe Actor-Critic Reinforcement Learning

稳健的Koopman控制屏障过滤器,实现安全的演员-批评者强化学习

Triadic Dynamics Aware Diffusion Posterior Sampling for Inverse Problems: Optimizing Guidance and Stochasticity Schedules

三元动力学感知扩散后采样反问题:优化引导与随机性时刻表

Heterogeneous AAV Logistics Task Allocation: A Reinforcement Learning Enhanced Overlapping Coalition Formation Game Approach

异构AAV后勤任务分配:强化学习增强型重叠联盟编组博弈方法

Efficient On-policy Visual-RL via Stochastic Decoupled Policy Gradient

通过随机解耦策略梯度实现高效的策略内可视化-强化学习

Beyond Pairwise Preferences: Listwise Reward-Aware Alignment for Diffusion Models

超越成对偏好:扩散模型的列表奖励感知比对

InterSketch: An Interleaved Reasoning Model with Self-correcting Visual Sketch and Stepwise Reward

InterSketch:一种交错推理模型,具有自我纠正的视觉素描和逐步奖励

StreamSplit: Continuous Audio Representation Learning via Uncertainty-Guided Adaptive Splitting

StreamSplit:通过不确定性引导自适应分流实现的连续音频表示学习

Focal Reward: Balanced Reinforcement Learning under Rubric-Based Rewards

焦点奖励:基于评分标准的奖励下的平衡强化学习

Spend Your Rollouts Where It Counts: Rollout Allocation for Group-Based RL Post-Training

在关键时刻使用你的推广时间:为基于小组的强化学习培训后分配部署

MedVol-R1: Reward-Driven Evidence Grounding for Volumetric Reasoning Segmentation

MedVol-R1:以奖励为驱动的证据基础,支持体积推理分割

Breaking the Epistemic Trap: Active Perception Under Compound Uncertainty

打破认识论陷阱:复合不确定性下的主动感知

UnityMAS-O: A General RL Optimization Framework for LLM-Based Multi-Agent Systems

UnityMAS-O:基于LLM的多智能体系统的通用强化学习优化框架

Bilevel Optimization over Saddle Points of Zero-Sum Markov Games

零和马可夫博弈鞍点上的双层优化

WINDQuant: Weight-Informed Neural Decision-Making for Global Mixed-Precision LLM Quantization

WINDQuant:基于权重的神经决策,用于全球混合精度大型语言模型量化

Beyond Trajectory-Level Attribution: Graph-Based Credit Assignment for Agentic Reinforcement Learning

超越轨迹级归因:基于图的学分赋值用于能动强化学习

Self-Improvement Imitation with Biologically Guided Search for Protein Design Under Oracle Budgets

在Oracle预算下,利用生物引导搜索进行蛋白质设计的自我提升模仿

Mind the Tool Failures: Achieving Synergistic Tool Gains for Medical Agents

注意工具故障:为医疗代理人实现协同工具收益

KARMA: Karma-Aligned Reward Model Adaptation

业力:业力对齐奖励模型适应

Adversarial Training for Robust Coverage Network under Worst-case Facility Losses

在最坏情况下设施损失下,强健覆盖网络的对抗性培训

Towards Generalization-Oriented Models for Vehicle Routing Problems with Mixture-of-Experts

面向泛化的车辆路径问题模型,专家混合研究

Ratio-Variance Regularized Policy Optimization

比率-方差正则化策略优化

SeDT: Sentence-Transformer Decision-Transformer Conditioning for Multi-Turn Conversation Reliability

SeDT:多回合对话可靠性的句子变换器决策-变换器条件

Optimising Factual Consistency in Summarisation via Preference Learning from Multiple Imperfect Metrics

通过从多个不完美指标中学习,优化摘要中的事实一致性

REVERSE: Reinforcing Evidence Verification and Search for Agentic Image geo-localization

反向:强化证据验证与代理图像地理定位搜索

GeoFaith: A Spatio-Temporal Dual View of Faithful Chain-of-Thought

地信:忠实思维链的时空双重视角

Learning to Adapt SFT Data for Better Reasoning Generalization

学习如何调整SFT数据以实现更好的推理泛化

Reasoning Depth and Environment Complexity: A Controlled Study of RLVR Data Allocation across Logical Reasoning Tasks

推理深度与环境复杂性:RLVR数据在逻辑推理任务间分配的受控研究

Efficient Agentic Reinforcement Learning with On-Policy Intrinsic Knowledge Boundary Enhancement

高效的代理强化学习与策略上内在知识边界增强

Tournament-GRPO: Group-Wise Tournament Rewards for Reinforcement Learning in Open-Ended Long-Form Generation

Tournament-GRPO:开放式长格式生成中强化学习的分组锦标赛奖励

RLVR Datasets and Where to Find Them: Tracing Data Lineage for Better Training Data

RLVR数据集及其定位:追踪数据谱系以获取更好的训练数据

Trust, Geometry, and Rules: A Credibility-Aware Reinforcement Learning Framework for Safe USV Navigation under Uncertainty

信任、几何与规则:一个可信度感知的强化学习框架,用于在不确定性下安全无人驾驶导航

Probabilistic Recurrent Intention Switching Model

概率循环意图切换模型

SQARL: A Size-Agnostic Reinforcement Learning approach for Circuit Allocation in Distributed Quantum Architectures

SQARL:分布式量子架构中电路分配的尺寸无关强化学习方法

Learning to Balance Motor Thermal Safety and Quadrupedal Locomotion Performance with Residual Policy

学习平衡发动机热安全与四足行走性能与残留政策

Large Language Model-Powered Query-Driven Event Timeline Summarization in Industrial Search

工业搜索中的大型语言模型驱动的查询驱动事件时间线摘要

Trust Region Q Adjoint Matching

信任区域Q伴随匹配

MuChator: Enabling Active Music Discovery via Conversational Music LLMs in Douyin Music

MuChator:通过抖音音乐中的对话式音乐大型语言模型实现主动音乐发现

Scaling, Benchmarking, and Reasoning of Vision-Language Agents for Mobile GUI Navigation

移动图形界面导航视觉语言代理的缩放、基准测试与推理

StepOPSD: Step-Aware Online Preference Distillation for Agent Reinforcement Learning

StepOPSD:用于代理强化学习的步进感知在线偏好提炼

Container Unloading via Reinforcement Learning: Picking Order, Deadlock Avoidance, and Proof-of-Concept Simulation

通过强化学习卸载集装箱:拣货顺序、避免死锁与概念验证仿真

Touch-R1: Reinforcing Touch Reasoning in MLLMs

Touch-R1:强化多层次导向学习中的触觉推理

FoundObj: Self-supervised Foundation Models as Rewards for Label-free 3D Object Segmentation

FoundObj:自监督基础模型作为无标签3D对象分割的奖励

It's Not Always Sycophancy: Measuring LLM Conformity as a Function of Epistemic Uncertainty

这并不总是谄媚:衡量LLM符合度是基于认识不确定性

BASIS: Batchwise Advantage Estimation from Single-Rollout Information Sharing for LLM Reasoning

基础:基于单次推出信息共享的批量优势估计,用于大型语言模型推理

Guiding LLM Post-training Data Engineering with Model Internals from Sparse Autoencoders

用稀疏自编码器模型内部结构指导LLM后训练数据工程

Alignment Tampering: How Reinforcement Learning from Human Feedback Is Exploited to Optimize Misaligned Biases

对齐篡改:如何利用人类反馈的强化学习来优化偏差偏差

Keyword: diffusion policy

Scaling World-Model Reinforcement Learning Through Diffusion Policy Optimization

通过扩散策略优化扩展世界模型强化学习

Riding the Shifting Potential: When Reactive Control Suffices for Multi-Goal Behavior

乘势而上:当反应式控制足以实现多目标行为时