生成时间: 2026-05-07 18:26:47 (UTC+8); Arxiv 发布时间: 2026-05-07 20:00 EDT (2026-05-08 08:00 UTC+8)

今天共有 40 篇相关文章

Keyword: reinforcement learning

Free Energy-Driven Reinforcement Learning with Adaptive Advantage Shaping for Unsupervised Reasoning in LLMs

自由能量驱动强化学习,采用自适应优势塑形,用于大型语言模型中的无监督推理

Adapt to Thrive! Adaptive Power-Mean Policy Optimization for Improved LLM Reasoning

适应以茁壮成长!自适应幂均值策略优化以提升LLM推理能力

Designing a double deep reinforcement learning selection tool for resilient demand prediction

设计一个用于韧性需求预测的双深度强化学习选择工具

Balanced Aggregation: Understanding and Fixing Aggregation Bias in GRPO

平衡聚合:理解并纠正GRPO中的聚合偏差

Constraint-Enhanced Reinforcement Learning Based on Dynamic Decoupled Spherical Radial Squashing

基于动态解耦球面径向压缩的约束增强强化学习

Hierarchical Support Vector State Partitioning for Distilling Black Box Reinforcement Learning Policies

用于提炼黑匣子强化学习策略的层级支持向量状态划分

Explaining and Preventing Alignment Collapse in Iterative RLHF

解释和防止迭代RLHF中的比对崩溃

Efficiently Aligning Language Models with Online Natural Language Feedback

高效地将语言模型与在线自然语言反馈对齐

Extending Differential Temporal Difference Methods for Episodic Problems

扩展情节性问题的微分时间差分方法

Joint Optimization of Trajectory Control, Resource Allocation, and Task Offloading for Multi-UAV-Assisted IoV

多无人机辅助IoV的轨迹控制、资源分配和任务卸载的联合优化

Queue-Aware and Resilient Routing in LEO Satellite Networks Using Multi-Agent Reinforcement Learning

利用多智能体强化学习实现低地轨道卫星网络中的队列感知和弹性路由

Data-dependent Exploration for Online Reinforcement Learning from Human Feedback

基于数据的在线强化探索:从人类反馈中学习

Towards General Preference Alignment: Diffusion Models at Nash Equilibrium

迈向一般偏好对齐:纳什均衡下的扩散模型

Pen-Strategist: A Reasoning Framework for Penetration Testing Strategy Formation and Analysis

笔策略师:渗透测试策略制定与分析的推理框架

Power Distribution Bridges Sampling, Self-Reward RL, and Self-Distillation

功率分配桥接采样、自我奖励强化学习和自我蒸馏

Counter-Dyna: Data-Efficient RL-Based HVAC Control using Counterfactual Building Models

Counter-Dyna:基于强化学习的数据高效HVAC控制,利用反事实建筑模型

Delay-Aware Large-Small Model Collaboration over LEO Satellite Networks

基于低轨道卫星网络的延迟感知大小模型协作

Dream-MPC: Gradient-Based Model Predictive Control with Latent Imagination

Dream-MPC:基于梯度的模型预测控制,利用潜在想象力

ReflectDrive-2: Reinforcement-Learning-Aligned Self-Editing for Discrete Diffusion Driving

ReflectDrive-2:离散扩散驱动的强化学习对齐自编辑

From Reach to Insert: Tactile-Augmented Precision Assembly under Sub-Millimeter Tolerances

从伸缩到插入:亚毫米公差下的触觉增强精密组装

ELVIS: Ensemble-Calibrated Latent Imagination for Long-Horizon Visual MPC

ELVIS:长视野视觉MPC的合奏校准潜在想象力

SPHERE: Mitigating the Loss of Spectral Plasticity in Mixture-of-Experts for Deep Reinforcement Learning

SPHERE:减少专家混合中深度强化学习中光谱可塑性的丧失

Every Step Counts: Step-Level Credit Assignment for Tool-Integrated Text-to-SQL

每一步都很重要:工具集成文本转SQL的步骤级学分作业

Reward-Decomposed Reinforcement Learning for Immersive Video Role-Playing

沉浸式视频角色扮演的奖励分解强化学习

Hierarachical Multiagent Reinforcement Learning for Multi-Group Tax Game

多组税务游戏的层级多智能体强化学习

VTAgent: Agentic Keyframe Anchoring for Evidence-Aware Video TextVQA

VTAgent:证据感知视频的代理关键帧锚定文本 VQA

A Harmonic Mean Formulation of Average Reward Reinforcement Learning in SMDPs

SMDP中平均奖励强化学习的谐均表述

A Hierarchical Agent System with Reinforcement Learning for Multivariate Time Series Data Cleaning

一个带有强化学习的分层代理系统,用于多变量时间序列数据清洗

Strat-Reasoner: Reinforcing Strategic Reasoning of LLMs in Multi-Agent Games

Strat-Reasoner:在多智能体游戏中强化大型语言模型的战略推理

Reinforcement Learning for Compositional Generalization with Outcome-Level Optimization

基于结果层级优化的组合推广强化学习

Modular Reinforcement Learning For Cooperative Swarms

合作群体模块化强化学习

EP-GRPO: Entropy-Progress Aligned Group Relative Policy Optimization with Implicit Process Guidance

EP-GRPO:熵-进展对齐的群体相对策略优化,含隐式流程指导

Graph-SND: Sparse Aggregation for Behavioral Diversity in Multi-Agent Reinforcement Learning

图-SND:多智能体强化学习中行为多样性的稀疏聚合

Preference-Based Self-Distillation: Beyond KL Matching via Reward Regularization

基于偏好的自我提炼:超越通过奖励正则化实现的基层匹配

Unified Framework of Distributional Regret in Multi-Armed Bandits and Reinforcement Learning

多臂盗贼与强化学习中的分布遗憾统一框架

LineRides: Line-Guided Reinforcement Learning for Bicycle Robot Stunts

LineRides:自行车机器人特技的线引导强化学习

Rollout Pass-Rate Control: Steering Binary-Reward RL Toward Its Most Informative Regime

推广通过率控制:引导二元奖励强化学习走向最具信息量的状态

Adaptive Policy Selection and Fine-Tuning under Interaction Budgets for Offline-to-Online Reinforcement Learning

互动预算下的自适应策略选择与微调,用于离线到在线强化学习

When Life Gives You BC, Make Q-functions: Extracting Q-values from Behavior Cloning for On-Robot Reinforcement Learning

当生活给你BC时,制作Q函数:从行为克隆中提取Q值以实现机器人强化学习

OpenSearch-VL: An Open Recipe for Frontier Multimodal Search Agents

OpenSearch-VL:前沿多模态搜索代理的开放配方

Keyword: diffusion policy

There is no result