生成时间: 2026-04-07 17:08:04 (UTC+8); Arxiv 发布时间: 2026-04-07 20:00 EDT (2026-04-08 08:00 UTC+8)

今天共有 56 篇相关文章

Keyword: reinforcement learning

Self-Execution Simulation Improves Coding Models

自执行模拟改进编码模型

SDVDiag: Using Context-Aware Causality Mining for the Diagnosis of Connected Vehicle Functions

SDVDiag:利用上下文感知因果挖掘诊断联网车辆功能

Hypernetwork-Conditioned Reinforcement Learning for Robust Control of Fixed-Wing Aircraft under Actuator Failures

超网络条件强化学习用于执行器故障时稳健控制固定翼飞机

Scaling Multi-agent Systems: A Smart Middleware for Improving Agent Interactions

扩展多智能体系统:提升智能体交互的智能中间件

Improving Feasibility via Fast Autoencoder-Based Projections

通过快速自编码器投影提升可行性

Sim2Real-AD: A Modular Sim-to-Real Framework for Deploying VLM-Guided Reinforcement Learning in Real-World Autonomous Driving

Sim2Real-AD:一个模块化模拟到现实框架,用于在现实世界自动驾驶中部署VLM引导强化学习

BioAlchemy: Distilling Biological Literature into Reasoning-Ready Reinforcement Learning Training Data

生物炼金术:将生物文献提炼为推理准备的强化学习训练数据

Optimizing Neurorobot Policy under Limited Demonstration Data through Preference Regret

通过偏好遗憾优化有限演示数据下的神经机器人政策

Drift-Based Policy Optimization: Native One-Step Policy Learning for Online Robot Control

基于漂移的策略优化:在线机器人控制的原生一步策略学习

When Adaptive Rewards Hurt: Causal Probing and the Switching-Stability Dilemma in LLM-Guided LEO Satellite Scheduling

当自适应奖励带来伤害:因果探测与LLM引导LEO卫星调度中的切换稳定性困境

HAD: Combining Hierarchical Diffusion with Metric-Decoupled RL for End-to-End Driving

HAD:将分层扩散与度量解耦强化学习结合,实现端到端驾驶

Delayed Homomorphic Reinforcement Learning for Environments with Delayed Feedback

延迟反馈环境的延迟同态强化学习

User Simulator-Guided Multi-Turn Preference Optimization for Reasoning LLM-based Conversational Recommendation

用户模拟器引导的多回合偏好优化,用于基于LLM的对话推荐推理

PRAISE: Prefix-Based Rollout Reuse in Agentic Search Training

赞扬:代理搜索培训中的基于前缀的推广重复使用

RL-Driven Sustainable Land-Use Allocation for the Lake Malawi Basin

由强化环境驱动的马拉维湖流域可持续土地利用分配

Decomposing Communication Gain and Delay Cost Under Cross-Timestep Delays in Cooperative Multi-Agent Reinforcement Learning

在合作多智能体强化学习中,跨时间步延迟下的通信增益和延迟成本分解

Provable Multi-Task Reinforcement Learning: A Representation Learning Framework with Low Rank Rewards

可证实的多任务强化学习:一种低秩奖励的表征学习框架

Can LLMs Learn to Reason Robustly under Noisy Supervision?

大型语言模型(LLM)能否在嘈杂的监督下学会强有力的推理能力?

VA-FastNavi-MARL: Real-Time Robot Control with Multimedia-Driven Meta-Reinforcement Learning

VA-FastNavi-MARL:实时机器人控制,结合多媒体驱动的元强化学习

Multi-AUV Trajectory Learning for Sustainable Underwater IoT with Acoustic Energy Transfer

多AUV轨迹学习,实现声能传输的可持续水下物联网

Fine-grained Analysis of Stability and Generalization for Stochastic Bilevel Optimization

随机双层优化稳定性与泛化的细粒度分析

Restless Bandits with Individual Penalty Constraints: A New Near-Optimal Index Policy and How to Learn It

个别惩罚约束的不安强盗:一种新的近优指数政策及其学习方法

Learning Dexterous Grasping from Sparse Taxonomy Guidance

从稀疏分类指导中学习灵巧抓握

DARE: Diffusion Large Language Models Alignment and Reinforcement Executor

DARE:扩散大型语言模型对齐与强化执行器

Learning from Imperfect Demonstrations via Temporal Behavior Tree-Guided Trajectory Repair

通过时间行为树引导轨迹修复从不完美演示中学习

Pedagogical Safety in Educational Reinforcement Learning: Formalizing and Detecting Reward Hacking in AI Tutoring Systems

教育强化学习中的教学安全:在AI辅导系统中形式化与检测奖励黑客

MC-CPO: Mastery-Conditioned Constrained Policy Optimization

MC-CPO:掌握条件约束策略优化

APPA: Adaptive Preference Pluralistic Alignment for Fair Federated RLHF of LLMs

APPA:适用于大型语言模型公平联邦RLHF的自适应偏好多元对齐

Boosted Distributional Reinforcement Learning: Analysis and Healthcare Applications

增强分布式强化学习:分析与医疗应用

Reinforce to Learn, Elect to Reason: A Dual Paradigm for Video Reasoning

强化学习,选择推理:视频推理的双范式

Finite-Time Analysis of Q-Value Iteration for General-Sum Stackelberg Games

一般和斯塔克尔伯格博弈Q值迭代的有限时间分析

ReinVBC: A Model-based Reinforcement Learning Approach to Vehicle Braking Controller

ReinVBC:基于模型的车辆制动控制器加固学习方法

Structured Causal Video Reasoning via Multi-Objective Alignment

通过多目标对齐进行结构化因果视频推理

Explainable Autonomous Cyber Defense using Adversarial Multi-Agent Reinforcement Learning

利用对抗性多智能体强化学习实现可解释的自主网络防御

DeonticBench: A Benchmark for Reasoning over Rules

DeonticBench:推理胜过规则的标杆

Retrieval Augmented Conversational Recommendation with Reinforcement Learning

带强化学习的检索增强会话推荐

One Model for All: Multi-Objective Controllable Language Models

全民统一模型:多目标可控语言模型

Memory Intelligence Agent

记忆智能代理

DRL-Based Phase Optimization for O-RIS in Dual-Hop Hard-Switching FSO/RIS-aided RF and UWOC Systems

基于DRL的双跳硬交换FSO/RIS辅助射频和UWOC系统中O-RIS的相位优化

FlashSAC: Fast and Stable Off-Policy Reinforcement Learning for High-Dimensional Robot Control

FlashSAC:高维机器人控制的快速稳定非策略强化学习

Paper Espresso: From Paper Overload to Research Insight

纸质浓缩:从纸张过载到研究洞察

Digital Privacy in IoT: Exploring Challenges, Approaches and Open Issues

物联网中的数字隐私:探索挑战、方法与悬而未决的问题

Anticipatory Reinforcement Learning: From Generative Path-Laws to Distributional Value Functions

预期强化学习:从生成路径定律到分布价值函数

Discovering Failure Modes in Vision-Language Models using RL

利用强化学习发现视觉语言模型中的失败模式

Cog-DRIFT: Exploration on Adaptively Reformulated Instances Enables Learning from Hard Reasoning Problems

Cog-DRIFT:探索自适应重构实例,使得从硬推理问题中学习

CLEAR: Unlocking Generative Potential for Degraded Image Understanding in Unified Multimodal Models

CLEAR:在统一多模态模型中释放图像理解退化的生成潜力

Selecting Decision-Relevant Concepts in Reinforcement Learning

强化学习中选择决策相关概念

Synthetic Sandbox for Training Machine Learning Engineering Agents

用于训练机器学习工程代理的合成沙盒

Data Attribution in Adaptive Learning

自适应学习中的数据归因

Rethinking Exploration in RLVR: From Entropy Regularization to Refinement via Bidirectional Entropy Modulation

重新思考RLVR中的探索:从熵正则化到通过双向熵调制进行精炼

QED-Nano: Teaching a Tiny Model to Prove Hard Theorems

QED-Nano:教授一个微小模型以证明难定理

Analyzing Symbolic Properties for DRL Agents in Systems and Networking

系统与网络中DRL代理的符号性质分析

Vero: An Open RL Recipe for General Visual Reasoning

Vero:一个开放的强化学习通用视觉推理配方

Stratifying Reinforcement Learning with Signal Temporal Logic

用信号时间逻辑分层强化学习

Keyword: diffusion policy

Diffusion Policy with Bayesian Expert Selection for Active Multi-Target Tracking

采用贝叶斯专家选择的扩散政策,用于主动多目标跟踪

HAD: Combining Hierarchical Diffusion with Metric-Decoupled RL for End-to-End Driving

HAD:将分层扩散与度量解耦强化学习结合,实现端到端驾驶