生成时间: 2026-03-24 16:59:30 (UTC+8); Arxiv 发布时间: 2026-03-24 20:00 EDT (2026-03-25 08:00 UTC+8)

今天共有 74 篇相关文章

Keyword: reinforcement learning

Fast-Slow Thinking RM: Efficient Integration of Scalar and Generative Reward Models

快慢思维 RM:标量与生成奖励模型的高效集成

Beyond Scalar Rewards: Distributional Reinforcement Learning with Preordered Objectives for Safe and Reliable Autonomous Driving

超越标量奖励:带有预购目标的分布式强化学习,实现安全可靠的自动驾驶

Emergency Lane-Change Simulation: A Behavioral Guidance Approach for Risky Scenario Generation

紧急变道模拟:一种用于风险场景生成的行为指导方法

Joint Trajectory, RIS, and Computation Offloading Optimization via Decentralized Model-Based PPO in Urban Multi-UAV Mobile Edge Computing

城市多无人机移动边缘计算中的联合轨迹、RIS与计算分担优化,通过去中心化基于模型的PPO实现优化

JCAS-MARL: Joint Communication and Sensing UAV Networks via Resource-Constrained Multi-Agent Reinforcement Learning

JCAS-MARL:通过资源受限多智能体强化学习实现联合通信与感测无人机网络

Learning Communication Between Heterogeneous Agents in Multi-Agent Reinforcement Learning for Autonomous Cyber Defence

多智能体强化学习中的异构智能体间通信,用于自主网络防御

MARLIN: Multi-Agent Reinforcement Learning for Incremental DAG Discovery

MARLIN:增量DAG发现的多智能体强化学习

Bounded Coupled AI Learning Dynamics in Tri-Hierarchical Drone Swarms

三层级无人机群体中的有界耦合AI学习动态

Leum-VL Technical Report

Leum-VL技术报告

CAMA: Exploring Collusive Adversarial Attacks in c-MARL

CAMA:探讨c-MARL中的对抗性攻击

SymCircuit: Bayesian Structure Inference for Tractable Probabilistic Circuits via Entropy-Regularized Reinforcement Learning

SymCircuit:通过熵正则化强化学习实现可处理概率电路的贝叶斯结构推断

Reinforcement Learning from Multi-Source Imperfect Preferences: Best-of-Both-Regimes Regret

从多源不完美偏好中强化学习:两者皆优的遗憾

Fluid Antenna Networks Beyond Beamforming: An AI-Native Control Paradigm for 6G

超越波束成形的流体天线网络:6G的AI原生控制范式

Grounded Chess Reasoning in Language Models via Master Distillation

通过主提纯法在语言模型中进行基础国际象棋推理

Delightful Distributed Policy Gradient

令人愉快的分布式政策梯度

Current state of the multi-agent multi-view experimental and digital twin rendezvous (MMEDR-Autonomous) framework

多智能体多视角实验与数字孪生会合(MMEDR-Autonomous)框架的现状

Towards Practical World Model-based Reinforcement Learning for Vision-Language-Action Models

迈向基于世界模型的视觉-语言-行动模型强化学习

Speedup Patch: Learning a Plug-and-Play Policy to Accelerate Embodied Manipulation

加速补丁:学习即插即用策略以加速具身操控

AI-Driven Multi-Agent Simulation of Stratified Polyamory Systems: A Computational Framework for Optimizing Social Reproductive Efficiency

AI驱动的多智能体分层多元恋系统模拟:优化社会生殖效率的计算框架

Clinical Cognition Alignment for Gastrointestinal Diagnosis with Multimodal LLMs

多模态大型语言模型用于胃肠道诊断的临床认知对齐

Decoupling Numerical and Structural Parameters: An Empirical Study on Adaptive Genetic Algorithms via Deep Reinforcement Learning for the Large-Scale TSP

数值与结构参数的解耦:一项通过深度强化学习实现自适应遗传算法的实证研究,适用于大规模TSP

RLVR Training of LLMs Does Not Improve Thinking Ability for General QA: Evaluation Method and a Simple Solution

对大型语言模型的RLVR训练并不能提升一般质量保证的思维能力:评估方法与简单解决方案

EruDiff: Refactoring Knowledge in Diffusion Models for Advanced Text-to-Image Synthesis

EruDiff:在扩散模型中重构知识以实现高级文本转图像合成

Deep Adaptive Rate Allocation in Volatile Heterogeneous Wireless Networks

易失异构无线网络中的深度自适应速率分配

Cyber Deception for Mission Surveillance via Hypergame-Theoretic Deep Reinforcement Learning

通过超博弈理论深度强化学习实现任务监视的网络欺骗

The Intelligent Disobedience Game: Formulating Disobedience in Stackelberg Games and Markov Decision Processes

智能不服从博弈:在斯塔克伯格博弈和马尔可夫决策过程中表述不服从

OrbitStream: Training-Free Adaptive 360-degree Video Streaming via Semantic Potential Fields

OrbitStream:通过语义势场实现无训练自适应360度视频流

DSL-R1: From SQL to DSL for Training Retrieval Agents across Structured and Unstructured Data with Reinforcement Learning

DSL-R1:从SQL到DSL,用于通过强化学习训练跨结构化和非结构化数据的检索代理

Knowledge Boundary Discovery for Large Language Models

大型语言模型的知识边界发现

DRL-driven Online Optimization for Joint Traffic Reshaping and Channel Reconfiguration in RIS-assisted Semantic NOMA Communications

基于DRL驱动的在线优化,用于RIS辅助语义NOMA通信中的联合流量重塑和信道重配置

Learning to Optimize Joint Source and RIS-assisted Channel Encoding for Multi-User Semantic Communication Systems

学习优化多用户语义通信系统的联合源和RIS辅助信道编码

VisFly-Lab: Unified Differentiable Framework for First-Order Reinforcement Learning of Quadrotor Control

VisFly-Lab:一阶强化学习四旋翼控制的统一可微框架

Anatomical Prior-Driven Framework for Autonomous Robotic Cardiac Ultrasound Standard View Acquisition

自主机器人心脏超声标准视图采集的解剖先验驱动框架

Incentivizing Generative Zero-Shot Learning via Outcome-Reward Reinforcement Learning with Visual Cues

通过通过视觉提示的成果-奖励强化学习激励生成零样本学习

Revisiting Tree Search for LLMs: Gumbel and Sequential Halving for Budget-Scalable Reasoning

重访大型语言模型的树状搜索:Gumbel和顺序减半法以实现预算可扩展推理

Rethinking Plasticity in Deep Reinforcement Learning

重新思考深度强化学习中的可塑性

Reward Sharpness-Aware Fine-Tuning for Diffusion Models

为扩散模型奖励锐度感知的微调

Prompt replay: speeding up grpo with on-policy reuse of high-signal prompts

提示回放:通过策略重用高信号提示加快GRPO

DeepXplain: XAI-Guided Autonomous Defense Against Multi-Stage APT Campaigns

DeepXplain:XAI引导的多阶段APT战役自主防御

RoboAlign: Learning Test-Time Reasoning for Language-Action Alignment in Vision-Language-Action Models

RoboAlign:学习视觉-语言-行动模型中语言-动作对齐的测试时间推理

A transformer architecture alteration to incentivise externalised reasoning

一项激励外部推理的变换器架构变更

PivotRL: High Accuracy Agentic Post-Training at Low Compute Cost

PivotRL:低计算成本实现高精度代理后训练

Dynasto: Validity-Aware Dynamic-Static Parameter Optimization for Autonomous Driving Testing

Dynasto:自动驾驶测试的有效性感知动态静态参数优化

KG-Hopper: Empowering Compact Open LLMs with Knowledge Graph Reasoning via Reinforcement Learning

KG-Hopper:通过强化学习赋能紧凑的开放大型语言模型,实现知识图谱推理

DRTriton: Large-Scale Synthetic Data Reinforcement Learning for Triton Kernel Generation

DRTriton:用于Triton内核生成的大规模合成数据强化学习

Learning Can Converge Stably to the Wrong Belief under Latent Reliability

学习可能会在潜在可靠度下稳定地收敛到错误的信念

VIGIL: Part-Grounded Structured Reasoning for Generalizable Deepfake Detection

VIGIL:部分基于基础的结构化推理用于通用深度伪造检测

What Do World Models Learn in RL? Probing Latent Representations in Learned Environment Simulators

世界模型在强化学习中学到了什么?学习环境模拟器中的探测潜在表征

Counterfactual Credit Policy Optimization for Multi-Agent Collaboration

多代理协作的反事实信贷策略优化

Adaptive Robust Estimator for Multi-Agent Reinforcement Learning

多智能体强化学习的自适应稳健估计器

Spatio-Temporal Attention Enhanced Multi-Agent DRL for UAV-Assisted Wireless Networks with Limited Communications

时空注意力增强型多智能体日程学习,适用于无人机辅助无线网络,通信有限

Proximal Policy Optimization in Path Space: A Schrödinger Bridge Perspective

路径空间中的近端策略优化:薛定谔桥视角

TAMTRL: Teacher-Aligned Reward Reshaping for Multi-Turn Reinforcement Learning in Long-Context Compression

TAMTRL:长上下文压缩中多回合强化学习的教师对齐奖励重塑

PPGL-Swarm: Integrated Multimodal Risk Stratification and Hereditary Syndrome Detection in Pheochromocytoma and Paraganglioma

PPGL-Swarm:综合多模态风险分层与遗传综合征检测在嗜铬细胞瘤和副神经节瘤中

EvoIdeator: Evolving Scientific Ideas through Checklist-Grounded Reinforcement Learning

EvoIdeator:通过以清单为基础的强化学习演进科学思想

CellFluxRL: Biologically-Constrained Virtual Cell Modeling via Reinforcement Learning

CellFluxRL:通过强化学习实现生物约束虚拟细胞建模

Image-Conditioned Adaptive Parameter Tuning for Visual Odometry Frontends

视觉里程计前端的图像调节自适应参数调优

Agentic Personas for Adaptive Scientific Explanations with Knowledge Graphs

带知识图谱的适应性科学解释的代理人物

P^2O: Joint Policy and Prompt Optimization

P^2O:联合策略与即时优化

Deep Reinforcement Learning and The Tale of Two Temporal Difference Errors

深度强化学习与两次时间差异错误的故事

Demystifying Reinforcement Learning for Long-Horizon Tool-Using Agents: A Comprehensive Recipe

揭开长期工具使用智能体的强化学习神秘面纱:全面配方

TREX: Trajectory Explanations for Multi-Objective Reinforcement Learning

TREX:多目标强化学习的轨迹解释

MEVIUS2: Practical Open-Source Quadruped Robot with Sheet Metal Welding and Multimodal Perception

MEVIUS2:具备钣金焊接和多模态感知的实用开源四足机器人

A Context Engineering Framework for Improving Enterprise AI Agents based on Digital-Twin MDP

基于数字孪生MDP改进企业AI代理的上下文工程框架

On the Direction of RLVR Updates for LLM Reasoning: Identification and Exploitation

关于 LLM 推理 RLVR 更新方向:识别与利用

Closed-Loop Verbal Reinforcement Learning for Task-Level Robotic Planning

任务级机器人规划的闭环口语强化学习

Cross-Modal Reinforcement Learning for Navigation with Degraded Depth Measurements

基于降级深度测量的跨模态强化学习导航

Seeing is Improving: Visual Feedback for Iterative Text Layout Refinement

看到就是进步:视觉反馈用于迭代文本布局优化

Make Tracking Easy: Neural Motion Retargeting for Humanoid Whole-body Control

让追踪变得简单:神经运动重定向用于人形全身控制

SpatialReward: Verifiable Spatial Reward Modeling for Fine-Grained Spatial Consistency in Text-to-Image Generation

空间奖励:文本到图像生成中可验证的空间奖励建模,实现细粒度空间一致性

DexDrummer: In-Hand, Contact-Rich, and Long-Horizon Dexterous Robot Drumming

DexDrummer:手持、接触丰富且远程的灵巧机器人鼓点

TiCo: Time-Controllable Training for Spoken Dialogue Models

TiCo:语音对话模型的时间可控训练

Decoupling Exploration and Policy Optimization: Uncertainty Guided Tree Search for Hard Exploration

解耦探索与策略优化:不确定性引导树搜索以进行硬探索

Keyword: diffusion policy

Dreaming the Unseen: World Model-regularized Diffusion Policy for Out-of-Distribution Robustness

梦见未见:世界模型规范化扩散政策,实现非分发鲁棒性