生成时间: 2026-04-28 18:18:15 (UTC+8); Arxiv 发布时间: 2026-04-28 20:00 EDT (2026-04-29 08:00 UTC+8)

今天共有 61 篇相关文章

Keyword: reinforcement learning

KARL: Mitigating Hallucinations in LLMs via Knowledge-Boundary-Aware Reinforcement Learning

KARL:通过知识边界感知强化学习减轻大型语言模型中的幻觉

Accelerating Reinforcement Learning for Wind Farm Control via Expert Demonstrations

通过专家演示加速风电场控制的强化学习

Load constrained wind farm flow control through multi-objective multi-agent reinforcement learning

通过多目标多代理强化学习实现负载约束风电场流量控制

Hierarchical RL-MPC Control for Dynamic Wake Steering in Wind Farms

风电场动态尾迹引导的分层RL-MPC控制

AeSlides: Incentivizing Aesthetic Layout in LLM-Based Slide Generation via Verifiable Rewards

AeSlides:通过可验证的奖励激励基于LLM的幻灯片生成中的美观布局

Risk Models as Mediating Artifacts: A Postphenomenological Analysis of the CIIM Framework in Cybersecurity Practice

风险模型作为中介工件:对CIIM框架在网络安全实践中的后现象学分析

When Policies Cannot Be Retrained: A Unified Closed-Form View of Post-Training Steering in Offline Reinforcement Learning

当政策无法再培训时:线下强化学习中培训后指导的统一封闭式视角

TexOCR: Advancing Document OCR Models for Compilable Page-to-LaTeX Reconstruction

TexOCR:推进文档OCR模型,用于可编译的页面转LaTeX重建

StackFeat RL: Reinforcement Learning over Iterative Dual Criterion Feature Selection for Stable Biomarker Discovery

StackFeat RL:基于迭代双重准则特征选择的强化学习,实现稳定生物标志物发现

DeepImagine: Learning Biomedical Reasoning via Successive Counterfactual Imagining

DeepImagine:通过连续的反事实想象学习生物医学推理

K-Score: Kalman Filter as a Principled Alternative to Reward Normalization in Reinforcement Learning

K-score:卡尔曼滤波器作为强化学习中奖励归一化的原则性替代方案

C-MORAL: Controllable Multi-Objective Molecular Optimization with Reinforcement Alignment for LLMs

C-MORAL:可控多目标分子优化与强化比对的大型语言模型

RL Token: Bootstrapping Online RL with Vision-Language-Action Models

强化学习令牌:利用视觉-语言-行动模型自助在线强化学习

UAV Trajectory and Bandwidth Allocation for Efficient Data Collection in Low-Altitude Intelligent IoT: A Hierarchical DRL Approach

低空智能物联网中无人机轨迹与带宽分配以实现高效数据收集:分层日程学习方法

Cooperative Informative Sensing for Monitoring Dynamic Indoor Environments via Multi-Agent Reinforcement Learning

通过多智能体强化学习监测动态室内环境的协作信息感知

CODA: Coordination via On-Policy Diffusion for Multi-Agent Offline Reinforcement Learning

CODA:通过策略扩散协调多智能体离线强化学习

GIFT: Global stabilisation via Intrinsic Fine Tuning

礼物:通过内在微调实现全球稳定

Hidden States Know Where Reasoning Diverges: Credit Assignment via Span-Level Wasserstein Distance

隐藏状态知道推理分歧之处:通过跨级瓦瑟斯坦距离进行学分分配

Process Supervision of Confidence Margin for Calibrated LLM Reasoning

校准LLM推理中置信裕度的过程监督

Bridging Reasoning and Action: Hybrid LLM-RL Framework for Efficient Cross-Domain Task-Oriented Dialogue

推理与行动的桥梁:高效跨领域任务导向对话的混合大型语言模型-强化学习框架

Learning from Demonstration with Failure Awareness for Safe Robot Navigation

通过演示与失效意识学习,实现机器人安全导航

V-GRPO: Online Reinforcement Learning for Denoising Generative Models Is Easier than You Think

V-GRPO:用于消除生成模型噪点的在线强化学习比你想象的要简单

Do Synthetic Trajectories Reflect Real Reward Hacking? A Systematic Study on Monitoring In-the-Wild Hacking in Code Generation

合成轨迹是否反映真实的奖励黑客?一项关于监控代码生成中野外黑客行为的系统性研究

DLM: Unified Decision Language Models for Offline Multi-Agent Sequential Decision Making

DLM:离线多智能体顺序决策的统一决策语言模型

CAPSULE: Control-Theoretic Action Perturbations for Safe Uncertainty-Aware Reinforcement Learning

CAPSULE:安全不确定性意识强化学习的控制理论作用扰动

DRL-Based Antenna Position Optimization For MA-Assisted OTFS System Under Imperfect CSI

基于DRL的MA辅助OTFS系统在不完美CSI下天线位置优化

GraphPlanner: Graph Memory-Augmented Agentic Routing for Multi-Agent LLMs

GraphPlanner:多智能体大型语言模型的图内存增强代理路由

Structural Enforcement of Goal Integrity in AI Agents via Separation-of-Powers Architecture

通过权力分立架构对AI代理目标完整性的结构性强制执行

QuietWalk: Physics-Informed Reinforcement Learning for Ground Reaction Force-Aware Humanoid Locomotion Under Diverse Footwear

QuietWalk:基于物理的强化学习,用于地面反应力感知的人形运动,适用于不同鞋类下的运动

SFT-then-RL Outperforms Mixed-Policy Methods for LLM Reasoning

SFT后RL在LLM推理中优于混合策略方法。

Unleashing the Agility of Wheeled-Legged Robots for High-Dynamic Reflexive Obstacle Evasion

释放轮式机器人的敏捷性,实现高动态反射障碍躲避

Scalable Production Scheduling: Linear Complexity via Unified Homogeneous Graphs

可扩展生产调度:通过统一齐次图实现线性复杂性

MUSIC: Learning Muscle-Driven Dexterous Hand Control

音乐:学习肌肉驱动的灵巧手部控制

Hindsight Preference Optimization for Financial Time Series Advisory

财务时间序列咨询的事后诸葛亮偏好优化

EPM-RL: Reinforcement Learning for On-Premise Product Mapping in E-Commerce

EPM-RL:电子商务中本地产品映射的强化学习

DeepTaxon: An Interpretable Retrieval-Augmented Multimodal Framework for Unified Species Identification and Discovery

DeepTaxon:一个可解释的检索增强多模态框架,用于统一物种识别与发现

Grounding Before Generalizing: How AI Differs from Humans in Causal Transfer

先扎根再概括:人工智能在因果传递上与人类的区别

AsyncShield: A Plug-and-Play Edge Adapter for Asynchronous Cloud-based VLA Navigation

AsyncShield:一款即插即用的异步云端VLA导航边缘适配器

IRIS: Interleaved Reinforcement with Incremental Staged Curriculum for Cross-Lingual Mathematical Reasoning

IRIS:跨语言数学推理的交错强化与增量分阶段课程

An Analysis of the Coordination Gap between Joint and Modular Learning for Job Shop Scheduling with Transportation Resources

联合学习与模块化学习在与交通资源的作业车间排班协调差距分析

Leveraging Human Feedback for Semantically-Relevant Skill Discovery

利用人类反馈实现语义相关的技能发现

POCA: Pareto-Optimal Curriculum Alignment for Visual Text Generation

POCA:视觉文本生成的帕累托最优课程对齐

Omni-o3: Deep Nested Omnimodal Deduction for Deliberative Audio-Visual Reasoning

Omni-o3:深层嵌套全模态推理用于深思熟虑视听推理

Rewarding the Scientific Process: Process-Level Reward Modeling for Agentic Data Analysis

奖励科学过程:代理数据分析的流程级奖励建模

BitRL: Reinforcement Learning with 1-bit Quantized Language Models for Resource-Constrained Edge Deployment

BitRL:基于1位量化语言模型的资源约束边缘部署强化学习

RAS: a Reliability Oriented Metric for Automatic Speech Recognition

RAS:一种以可靠性为导向的自动语音识别指标

Model-Free Inference of Investor Preferences: A Relative Entropy IRL Approach

投资者偏好的无模型推断:一种相对熵的现实生活中方法

DPEPO: Diverse Parallel Exploration Policy Optimization for LLM-based Agents

DPEPO:基于LLM的代理的多样化并行探索策略优化

Perfecting Aircraft Maneuvers with Reinforcement Learning

通过强化学习完善飞机机动

See Further, Think Deeper: Advancing VLM's Reasoning Ability with Low-level Visual Cues and Reflection

详见《深入思考:通过低层次视觉线索和反思提升VLM的推理能力》

An Aircraft Upset Recovery System with Reinforcement Learning

带有强化学习的飞机故障救援系统

Comparative Evaluation of Modern Deep Learning Methodologies for Portfolio Optimization

现代深度学习方法论在投资组合优化中的比较评估

TARMM: Scaling Delay-Critical Edge AI Offloading in 5G O-RAN via Temporal Graph Mobility Management

TARMM:通过时序图移动管理在5G O-RAN中扩展延迟关键边缘AI卸载

DECOFFEE: Decentralized Reinforcement Learning for Time-critical Workload Offloading and Energy Efficiency across the Computing Continuum

DECOFFEE:去中心化强化学习,实现计算连续体中对时间关键工作负载的卸载和能效

A Reward-Free Viewpoint on Multi-Objective Reinforcement Learning

多目标强化学习的无奖励视角

Hierarchical Behaviour Spaces

层级行为空间

Improving Vision-language Models with Perception-centric Process Reward Models

改进以感知为中心的过程奖励模型的视觉语言模型

Agent-Centric Visual Reinforcement Learning under Dynamic Perturbations

动态扰动下的以代理为中心视觉强化学习

SpecRLBench: A Benchmark for Generalization in Specification-Guided Reinforcement Learning

SpecRLBench:规范引导强化学习泛化的基准

World-R1: Reinforcing 3D Constraints for Text-to-Video Generation

World-R1:加强文本转视频生成的三维约束

Keyword: diffusion policy

Tube Diffusion Policy: Reactive Visual-Tactile Policy Learning for Contact-rich Manipulation

管道扩散政策:针对接触丰富操作的反应性视觉-触觉策略学习