生成时间: 2026-03-04 16:42:50 (UTC+8); Arxiv 发布时间: 2026-03-04 20:00 EST (2026-03-05 09:00 UTC+8)

今天共有 44 篇相关文章

Keyword: reinforcement learning

ATPO: Adaptive Tree Policy Optimization for Multi-Turn Medical Dialogue

ATPO:多回合医疗对话的自适应树政策优化

When Scaling Fails: Mitigating Audio Perception Decay of LALMs via Multi-Step Perception-Aware Reasoning

当缩放失败时:通过多步感知感知推理缓解LALM的音频感知衰减

COOL-MC: Verifying and Explaining RL Policies for Platelet Inventory Management

COOL-MC:核实并解释血小板库存管理的强化学习政策

TraceGuard: Process-Guided Firewall against Reasoning Backdoors in Large Language Models

TraceGuard:针对大型语言模型中推理后门的进程引导防火墙

Safe Whole-Body Loco-Manipulation via Combined Model and Learning-based Control

通过结合模型和基于学习的控制实现安全的全身机车作

RIS-Enabled Wireless Channel Equalization: Adaptive RIS Equalizer and Deep Reinforcement Learning

RIS支持的无线信道均衡:自适应RIS均衡器和深度强化学习

Wasserstein Proximal Policy Gradient

瓦瑟斯坦近端政策梯度

Towards Parameter-Free Temporal Difference Learning

迈向无参数时间差分学习

Heterogeneous Agent Collaborative Reinforcement Learning

异构代理协作强化学习

Real-Time Generative Policy via Langevin-Guided Flow Matching for Autonomous Driving

通过朗之万引导流量匹配实现自动驾驶的实时生成策略

Post Hoc Extraction of Pareto Fronts for Continuous Control

后续提取帕累托前缘以实现连续控制

StitchCUDA: An Automated Multi-Agents End-to-End GPU Programing Framework with Rubric-based Agentic Reinforcement Learning

StitchCUDA:一个自动化多代理端到端GPU编程框架,支持基于评分标准的代理强化学习

Improving Diffusion Planners by Self-Supervised Action Gating with Energies

通过自我监督行动能量门控改善扩散规划器

Generalized Per-Agent Advantage Estimation for Multi-Agent Policy Optimization

多智能体策略优化的广义每代理优势估计

Watch Your Step: Learning Semantically-Guided Locomotion in Cluttered Environment

小心脚下:在杂乱环境中学习语义引导的移动

VisionCreator: A Native Visual-Generation Agentic Model with Understanding, Thinking, Planning and Creation

VisionCreator:一个原生视觉生成智能模型,具备理解、思考、规划和创造能力

Graph-GRPO: Stabilizing Multi-Agent Topology Learning via Group Relative Policy Optimization

Graph-GRPO:通过群相对策略优化稳定多智能体拓扑学习

From "What" to "How": Constrained Reasoning for Autoregressive Image Generation

从“什么”到“如何”:自回归图像生成的受限推理

Enhancing User Throughput in Multi-panel mmWave Radio Access Networks for Beam-based MU-MIMO Using a DRL Method

利用DRL方法提升多面板毫米波无线接入网中基于束流的MU-MIMO用户吞吐量

Next Embedding Prediction Makes World Models Stronger

下一个嵌入预测使世界模型更强大

VSearcher: Long-Horizon Multimodal Search Agent via Reinforcement Learning

VSearcher:通过强化学习实现的长视界多模态搜索代理

Learning Memory-Enhanced Improvement Heuristics for Flexible Job Shop Scheduling

学习记忆增强改进启发式方法,以实现灵活的工序排班

Rhythm: Learning Interactive Whole-Body Control for Dual Humanoids

节奏:学习双人生物的互动全身控制

Learning in Markov Decision Processes with Exogenous Dynamics

马尔可夫决策过程的外生动力学学习

SAE as a Crystal Ball: Interpretable Features Predict Cross-domain Transferability of LLMs without Training

SAE作为水晶球:可解释特征预测LLM在不需训练的情况下跨域迁移

On the Structural Limitations of Weight-Based Neural Adaptation and the Role of Reversible Behavioral Learning

关于基于权重的神经适应的结构性局限性及可逆行为学习的作用

Contextual Latent World Models for Offline Meta Reinforcement Learning

离线元强化学习的上下文潜在世界模型

Beyond One-Size-Fits-All: Adaptive Subgraph Denoising for Zero-Shot Graph Learning with Large Language Models

超越一刀切:在大型语言模型下零截图图学习中的自适应子图去噪

CGL: Advancing Continual GUI Learning via Reinforcement Fine-Tuning

CGL:通过强化微调推进持续的图形用户界面学习

DreamFlow: Local Navigation Beyond Observation via Conditional Flow Matching in the Latent Space

梦流:通过条件流匹配在潜在空间中实现的本地导航,超越观察

Contextualized Privacy Defense for LLM Agents

LLM代理的情境化隐私防御

Why Does RLAIF Work At All?

RLAIF 为什么能正常工作?

PrivMedChat: End-to-End Differentially Private RLHF for Medical Dialogue Systems

PrivMedChat:端到端的差异私密RLHF,用于医疗对话系统

CMoE: Contrastive Mixture of Experts for Motion Control and Terrain Adaptation of Humanoid Robots

CMoE:人形机器人运动控制与地形适应专家的对比组合

Reinforcement Learning with Symbolic Reward Machines

符号奖励机的强化学习

TikZilla: Scaling Text-to-TikZ with High-Quality Data and Reinforcement Learning

TikZilla:通过高质量数据和强化学习,将文本扩展到TikZ

RAPO: Expanding Exploration for LLM Agents via Retrieval-Augmented Policy Optimization

RAPO:通过检索增强策略优化扩展LLM代理的探索

Proactive Guiding Strategy for Item-side Fairness in Interactive Recommendation

互动推荐中项目端公平性的主动指导策略

Deep Q-Learning-Based Gain Scheduling for Nonlinear Quadcopter Dynamics

基于深度Q学习的非线性四旋翼飞行器动力学增益调度

RL-Based Coverage Path Planning for Deformable Objects on 3D Surfaces

基于强化学习的三维表面可变形物体覆盖路径规划

Geometry-Guided Reinforcement Learning for Multi-view Consistent 3D Scene Editing

多视角一致3D场景编辑的几何引导强化学习

Specificity-aware reinforcement learning for fine-grained open-world classification

针对细粒度开放世界分类的特异性感知强化学习

Learning When to Act or Refuse: Guarding Agentic Reasoning Models for Safe Multi-Step Tool Use

学习何时行动或拒绝:保护智能推理模型以保障安全多步工具的使用。

ULTRA: Unified Multimodal Control for Autonomous Humanoid Whole-Body Loco-Manipulation

ULTRA:自主人形全身机车控的统一多模态控制

Keyword: diffusion policy

There is no result