生成时间: 2026-04-17 17:21:19 (UTC+8); Arxiv 发布时间: 2026-04-17 20:00 EDT (2026-04-18 08:00 UTC+8)

今天共有 36 篇相关文章

Keyword: reinforcement learning

GFT: From Imitation to Reward Fine-Tuning with Unbiased Group Advantages and Dynamic Coefficient Rectification

GFT:从模仿到奖励微调,具有无偏群优势和动态系数整流

Reinforcement Learning via Value Gradient Flow

通过价值梯度流进行强化学习

Enhancing LLM-based Search Agents via Contribution Weighted Group Relative Policy Optimization

通过贡献加权组相对策略优化,增强基于LLM的搜索代理

Aerial Multi-Functional RIS in Fluid Antennas-Aided Full-Duplex Networks: A Self-Optimized Hybrid Deep Reinforcement Learning Approach

流体天线辅助全双工网络中的空中多功能远程信息系统:一种自我优化的混合深度强化学习方法

When Missing Becomes Structure: Intent-Preserving Policy Completion from Financial KOL Discourse

当缺失变成结构:金融KOL话语中的意图保全政策完成

Step-level Denoising-time Diffusion Alignment with Multiple Objectives

多目标的阶级去噪时间扩散比准

On Tackling Complex Tasks with Reward Machines and Signal Temporal Logics

关于用奖励机和信号时间逻辑解决复杂任务

Improving Human Performance with Value-Aware Interventions: A Case Study in Chess

通过价值意识干预提升人类表现:国际象棋案例研究

Scouting By Reward: VLM-TO-IRL-Driven Player Selection For Esports

奖励球探:电子竞技从虚拟平台到现实世界的选手选择

Evo-MedAgent: Beyond One-Shot Diagnosis with Agents That Remember, Reflect, and Improve

Evo-MedAgent:超越一次性诊断,使用记忆、反思并改进的特工

MARS$^2$: Scaling Multi-Agent Tree Search via Reinforcement Learning for Code Generation

MARS$^2$:通过强化学习扩展多智能体树搜索以生成代码

Model-Based Reinforcement Learning Exploits Passive Body Dynamics for High-Performance Biped Robot Locomotion

基于模型的强化学习利用被动身体动力学实现高性能双足机器人运动

Asking What Matters: Reward-Driven Clarification for Software Engineering Tasks

问重要性:软件工程任务的奖励驱动澄清

Targeted Exploration via Unified Entropy Control for Reinforcement Learning

通过统一熵控制进行有针对性的探索以实现强化学习

ClariCodec: Optimising Neural Speech Codes for 200bps Communication using Reinforcement Learning

ClariCodec:利用强化学习优化200bps通信的神经语音代码

Chain-of-Glimpse: Search-Guided Progressive Object-Grounded Reasoning for Video Understanding

一瞥链:搜索引导的渐进式对象基础推理以视频理解

Mean Flow Policy Optimization

平均流量策略优化

The Courtroom Trial of Pixels: Robust Image Manipulation Localization via Adversarial Evidence and Reinforcement Learning Judgment

《像素法庭审判:通过对抗证据和强化学习判断进行强健图像处理定位》

RELOAD: A Robust and Efficient Learned Query Optimizer for Database Systems

RELOAD:一个稳健高效的数据库系统学习查询优化器

Wasserstein Formulation of Reinforcement Learning. An Optimal Transport Perspective on Policy Optimization

Wasserstein 强化学习表述。政策优化的最佳交通视角

Learning Ad Hoc Network Dynamics via Graph-Structured World Models

通过图结构化世界模型学习临时网络动力学

SWE-TRACE: Optimizing Long-Horizon SWE Agents Through Rubric Process Reward Models and Heuristic Test-Time Scaling

SWE-TRACE:通过评分标准过程奖励模型和启发式测试时间尺度优化长期SWE代理

Switch: Learning Agile Skills Switching for Humanoid Robots

切换:学习敏捷技能 为类人机器人切换

Does RL Expand the Capability Boundary of LLM Agents? A PASS@(k,T) Analysis

强化学习是否扩展了LLM代理的能力边界?PASS@(k,T)分析

GenRec: A Preference-Oriented Generative Framework for Large-Scale Recommendation

GenRec:一个面向偏好的大规模推荐生成框架

Dual-Axis Generative Reward Model Toward Semantic and Turn-taking Robustness in Interactive Spoken Dialogue Models

双轴生成奖励模型:在互动口语对话模型中实现语义和轮流稳健性

LongAct: Harnessing Intrinsic Activation Patterns for Long-Context Reinforcement Learning

LongAct:利用内在激活模式进行长语境强化学习

WavAlign: Enhancing Intelligence and Expressiveness in Spoken Dialogue Models via Adaptive Hybrid Post-Training

WavAlign:通过自适应混合后训练提升口头对话模型中的智力和表达力

UniDoc-RL: Coarse-to-Fine Visual RAG with Hierarchical Actions and Dense Rewards

UniDoc-RL:从粗到细的视觉RAG,具有层级动作和密集奖励

Momentum-constrained Hybrid Heuristic Trajectory Optimization Framework with Residual-enhanced DRL for Visually Impaired Scenarios

适用于视障场景的动量约束混合启发式轨迹优化框架,带有残差增强的日程学习

IG-Search: Step-Level Information Gain Rewards for Search-Augmented Reasoning

IG-搜索:搜索增强推理带来的步骤级信息奖励

LLMs Gaming Verifiers: RLVR can Lead to Reward Hacking

大型语言模型游戏验证者:RLVR可能导致奖励黑客攻击

RL-STPA: Adapting System-Theoretic Hazard Analysis for Safety-Critical Reinforcement Learning

RL-STPA:适应系统理论危害分析以适应安全关键强化学习

Abstract Sim2Real through Approximate Information States

通过近似信息态进行 Sim2Real 的抽象

Generalization in LLM Problem Solving: The Case of the Shortest Path

LLM问题解决中的推广:最短路径的情况

RAD-2: Scaling Reinforcement Learning in a Generator-Discriminator Framework

RAD-2:生成器-判别器框架下的扩展强化学习

Keyword: diffusion policy

There is no result