生成时间: 2026-06-23 19:22:31 (UTC+8); Arxiv 发布时间: 2026-06-23 20:00 EDT (2026-06-24 08:00 UTC+8)

今天共有 93 篇相关文章

Keyword: reinforcement learning

RL-based Joint Coverage and Beam Optimization of High Altitude Platform Systems

基于强化学习的高空平台系统联合覆盖与波束优化

Darwin Mobile Agent: A Roadmap for Self-Evolution

达尔文移动代理:自我进化路线图

An LLM-Explainable DRL Framework for Passenger-Directed Autonomous Driving

一个可解释的 LLM DRL 乘客主导自动驾驶框架

MAGNIFIED: RL Fine-tuning of Multimodal Large Language Models for Motion Planning

放大:多模态大型语言模型的强化学习微调以实现运动规划

Platooning Connected, Autonomous, and Human-Driven Vehicles: A Deep Reinforcement Learning-based Approach

分队编排联网、自主和人驾驶车辆:基于深度强化学习的方法

SafeDojo: Safe Reinforcement Learning for VLA via Interactive World Model

SafeDojo:通过交互式世界模型实现VLA的安全强化学习

BARD-MARL: Byzantine-Agent Detection for Learned Communication in Multi-Agent Reinforcement Learning

BARD-MARL:多智能体强化学习中学习交流的拜占庭代理检测

MotionPyramid: Hierarchical Motion Representation and Residual Interfaces

MotionPyramid:分层运动表示与残余接口

Empowering Economic Simulation Through Situation-Aware Llm-Driven Generative System

通过情境感知的 LMM 驱动生成系统赋能经济模拟

Provably Sub-Linear Two-Timescale NeuroEvolution with Online Plasticity

可证明的亚线性两时间尺度神经进化与在线可塑性

Evolutionary Discovery of Developmental Reward Schedules in Deep Reinforcement Learning

深度强化学习中发展性奖励时间表的进化发现

When Do Intrinsic Rewards Work for Code Reasoning? A Comprehensive Study

内在奖励在代码推理中何时有效?一项综合研究

Learning-Based List Sequential Belief Propagation Decoding of Quantum LDPC Codes

基于学习的列表顺序信念传播解码量子LDPC码

Heterogeneous Policy Networks for Composite Robot Team Communication and Coordination

复合机器人团队通信与协调的异构政策网络

Formalizing Task-Space Complexity for Zero-Shot Generalization

零射推广任务空间复杂性的形式化

CogniRoute: Learning to Route Social Evidence in Omni-Modal Models

CogniRoute:学习在全模态模型中路由社会证据

Sim2O: Efficient Offline-to-Online MARL via Joint Action Composition

Sim2O:通过联合行动组合实现高效的离线到在线MARL

Horizon Adaptive Offline Policy Learning via Value Stitching

地平线自适应离线策略学习,通过价值拼接

Pose-Agnostic Robotic Functional Grasping via Observation-Action Canonicalization

通过观察-动作规范化实现无相态机器人功能抓取

Inverting the Bellman Equation: From $Q$-Values to World Models

反转贝尔曼方程:从$Q$值到世界模型

Sakana Fugu Technical Report

佐贺奈富古技术报告

ARCO: Adaptive Rubric with Co-Evolution for Multi-Step LLM-Based Agents

ARCO:多步LLM代理的自适应评分标准与共进化

Reward-free Pretraining for Reinforcement Learning via Occupancy Coverage Maximization

通过占用覆盖最大化实现的无奖励强化学习预训练

NASDAQ: Normalized Observation Space Dynamics-Augmented Q-Learning

纳斯达克:规范化观测空间动力学增强Q-学习

A Test-time Actor-Critic Approach to News Images Generation

一种测试时的演员-评论家新闻图像生成方法

Objective-Behavior Alignment: Diagnostics for MORL Policy Selection

目标-行为对齐:MORL策略选择的诊断

A Reward-Petri-Net Interpretation of Temporal Behavior Trees

奖励-彼得网对时间行为树的解释

Long-Distance Real-World Navigation of the Legged-Wheeled Robot Go2-W Using Deep Reinforcement Learning

利用深度强化学习实现有腿轮机器人Go2-W的远程真实世界导航

Federated Temporal Attention Intelligence for Cyber-Resilient IoMT: Lightweight Digital Twins and PPO-Driven Honeypot Deception

联邦时间注意力智能用于网络韧性物联网:轻量级数字孪生与PPO驱动的蜜罐欺骗

Precision Recall Controllable Radiology Report Generation via Hybrid Natural Language and Clinical Reward Learning

通过混合自然语言与临床奖励学习生成精确回忆可控放射科报告

Balancing Performance and Diversity in GRPO Autoregressive Text-to-Image Post-Training

在GRPO自回归文本转图像后培训中平衡性能与多样性

Backpropagating Through Simulation: Analytic Policy Gradients for Sample and Learning Efficient Differentiable Continuous Control

通过模拟进行反向传播:样本与学习高效可微连续控制的分析策略梯度

FAST: A Framework for Aligned Sampling and Training in Parallel Reinforcement Learning for Autonomous Driving

FAST:自动驾驶并行强化学习中的对齐抽样与培训框架

The Two-Hump Problem: Bridging the Difficulty Gap in Mathematical Reinforcement Learning

双峰问题:弥合数学强化学习中的难度差距

Motion-Aware Reinforcement Learning For Object Localization

物体定位的动作感知强化学习

CalVerT: Augmenting Agents with Calibrated Verifier Telemetry Improves Action and Learning in Knowledge-Intensive Tasks

CalVerT:通过校准验证者遥测增强智能体在知识密集型任务中的行动和学习能力

KineticSim: A Lightweight, High-Performance Execution Engine for Real-Time Market Simulators

KineticSim:一款用于实时市场模拟器的轻量高性能执行引擎

Discretizing Reward Models

离散化奖励模型

Mat-Pref: Verifiable-Reward Training Improves Compositional Reasoning in Inorganic Materials

Mat-Pref:可验证奖励训练提升无机材料的组成推理能力

Modularized Reinforcement Learning on LLMs: From MDP Creation to Exploration and Learning

大型语言模型上的模块化强化学习:从MDP创建到探索与学习

IRumAI: Reinforcement Learning for Indian Rummy

IRumAI:印度拉米牌的强化学习

RARM: Confidence-Gated Progress Reward Modeling for RL in Manipulation

RARM:基于操作中的强化学习信心门槛进步奖励建模

Deep RL- Tuned Mo del-Free Adaptive Control for Lower-Limb Exoskeletons During Sit-to-Stand Transitions

深RL——调谐无须-自由自适应控制,适用于下肢外骨骼在坐立过渡阶段

When Does a Video-Language Model Stop Watching? Reward Strength Controls the Formation and Reversal of Visual Shortcuts in Multimodal RLVR

视频语言模型什么时候停止观看?奖励强度控制多模态RLVR中视觉捷径的形成和逆转

Reinforcement Learning-Based Traffic Signal Control for IoT-Enabled Intersections

基于强化学习的物联网交叉通信号控制

Zero-shot Transfer of Reinforcement Learning Control Policies for the Swing-Up and Stabilization of a Cart-Pole System

零发射强化学习控制策略,用于车杆系统的摆动和稳定

Meta-Reinforcement Learning via Evolution for Multi-Objective Combinatorial Supply Chain Optimisation

通过进化实现多目标组合供应链优化的元强化学习

L20-Edu-135M: An Auditable Single-GPU Study of Data-Efficient Small Language Modeling

L20-Edu-135M:一项可审计的单GPU数据高效小语言建模研究

FlowDPG: Deterministic Policy Gradient on Flow Matching Policies for Real-World Manipulation

FlowDPG:现实世界操作中流匹配策略的确定性策略梯度

Learning at the Right Pace: Adaptive Data Scheduling Improves LLM Reinforcement Learning

以正确的节奏学习:自适应数据调度提升LLM强化学习

Curriculum Reinforcement Learning Can Incentivize Reasoning Capacity in LLMs Beyond the Base Model

课程强化学习可以激励LLM中超越基础模型的推理能力

Select-to-Act: Hierarchical Reinforcement Learning via Adaptive Language Guidance

选择行动:通过自适应语言指导实现的层级强化学习

Curvature-Adaptive Consistency Flow Matching: Autonomous Trajectory Optimization via Reinforcement Learning

曲率自适应一致性流匹配:通过强化学习实现自主轨迹优化

SVGym (SciVerseGym): An Environment for Reinforcement Learning and Bayesian Optimization in Crystal Discovery

SVGym(SciVerseGym):晶体发现中的强化学习与贝叶斯优化环境

Escaping the Variance Trap: Jacobian-Free Dynamics for Root-Finding Bilevel Optimization

逃离方差陷阱:在找根双层优化中的雅可比无动力学

Distribution-Aware Robust Bilevel Optimization: Quantile-Guided Huber Updates in Two-Timescale Stochastic Approximation

分布感知的鲁棒双层优化:二时间尺度随机近似中的分位数引导胡伯更新

A Differentiable Atari VCS:A Complex, Fully Known Ground Truth for Explainable AI

可微分的雅达利VCS:一个复杂且完全已知的可解释人工智能的基层真相

Scalable Multi-Task Data Generation via Reinforcement Learning for Language-Conditioned Bimanual Dexterous Manipulation

通过强化学习实现语言条件化双手灵巧操作的可扩展多任务数据生成

WebCQ: Cooperative Multi-Agent Deep Reinforcement Learning for Scalable Web GUI Testing

WebCQ:可扩展Web图形界面测试的协作式多智能体深度强化学习

Imagine to Ensure Safety in Hierarchical Reinforcement Learning

想象一下,在层级强化学习中确保安全

PolicyTrim: Boosting Intrinsic Policy Efficiency of Vision-Language-Action Models

PolicyTrim:提升愿景-语言-行动模型的内在政策效率

What are Key Factors for Updates in RL for LLM Reasoning?

强化学习中LLM推理更新的关键因素有哪些?

Stationary Robust Mean-Field Games under Model Mismatches

模型错配下的稳健平均场平稳博弈

On the Position Bias of On-Policy Distillation

关于政策提炼的立场偏见

Scalable Maximum Entropy Reinforcement Learning for Diffusion Policies via Adjoint Matching

通过伴随匹配实现扩散策略的可扩展最大熵强化学习

A Markov Chain Approach to Preference Alignment

偏好比对的马尔可夫链方法

GeoRouteNet: Geometry-Enhanced Non-Autoregressive Neural Solver for the Traveling Salesman Problem

GeoRouteNet:几何增强的非自回归神经求解器,解决旅行推销员问题

Policy-as-Data: Learning Generalizable HOI Diffusion Models from Simulated Physics

政策即数据:从模拟物理学中学习可推广的HOI扩散模型

Active Inference as the Test-Time Scaling Law for Physical AI Agents

主动推断作为物理人工智能代理的测试时间尺度定律

HiL-ResRL: A Model-Agnostic Finetuning Adapter via Human-in-the-loop Residual Reinforcement Learning

HiL-ResRL:通过人机循环残余强化学习的模型无关性微调适配器

SingGuard: A Policy-Adaptive Multimodal LLM Guardrail with Dynamic Reasoning

SingGuard:具有动态推理的策略自适应多模态LLM护栏

Hierarchical Reinforcement Learning for Sparse-Reward Search in Commutative Algebra

交换代数中稀疏奖励搜索的层级强化学习

EchoFlow: A Workload-Aware Parameter Tuning Method for Blockchain Systems

EchoFlow:一种区块链系统工作负载感知参数调优方法

Provable Benefits of RLVR over SFT for Reasoning Models: Learning to Backtrack Efficiently

RLVR相较于SFT在推理模型上的可证明优势:学习高效回溯

Group-Graph Policy Optimization for Long-Horizon Agentic Reinforcement Learning

长视野代理强化学习的群图策略优化

EvoRubrics: Dynamic Rubrics as Rewards via Adversarial Co-Evolution for LLM Reinforcement Learning

进化评分标准:通过对抗共进化作为奖励的动态评分标准,用于大型语言模型强化学习

ReNIO: Reweighting Negative Trajectory Importance for LLM On-Policy Distillation

ReNIO:对LLM政策提炼负轨迹重要性的重新加权

Asymmetric physics enables efficient learning in quadrupedal robot swarms

非对称物理使四足机器人群体中的学习更加高效

CFPO: Counterfactual Policy Optimization for Multimodal Reasoning

CFPO:多模态推理的反事实政策优化

Dynamic multi-agent deep reinforcement learning-based pricing and incentivization approach in multimodal transportation networks

多模式交通网络中的动态多智能体深度强化基于学习的定价与激励方法

BoxCtrl: 3D-Aware Visual Prompting for Geometric Image Editing

BoxCtrl:三维感知的几何图像编辑视觉提示

Causal Reward World Models: Zero-shot Reward Design for Automated Skill Generation

因果奖励世界模型:零机会奖励设计用于自动技能生成

SQLConductor: Search-to-Policy Learning for Step-wise Text-to-SQL Orchestration

SQLConductor:基于分步文本转SQL编排的搜索到策略学习

VeriEvol: Scaling Multimodal Mathematical Reasoning via Verifiable Evol-Instruct

VeriEvol:通过可验证的Evol-Instruct扩展多模数学推理

Decentralized Autonomous Traffic Management through Corridor Networks

通过走廊网络实现的去中心化自治交通管理

SPIRAL: Learning to Search and Aggregate

螺旋:学习搜索与聚合

dVLA-RL: Reinforcement Learning over Denoising Trajectories for Discrete Diffusion Vision-Language-Action Models

dVLA-RL:离散扩散视觉-语言-动作模型的去噪轨迹强化学习

Learning Process Rewards via Success Visitation Matching for Efficient RL

通过成功访问匹配来学习过程奖励,实现高效的强化学习

AIR: Adaptive Interleaved Reasoning with Code in MLLMs

AIR:多层次语言学习模型中的自适应交错推理与代码

CoorDex: Coordinating Body and Hand Priors for Continuous Dexterous Humanoid Loco-Manipulation

CoorDex:协调身体与手部先验,实现持续灵巧的人形机动操控

Keyword: diffusion policy

BayesFP: Posterior Estimation for Flow-Based Policies via Feynman-Kac Sampling

BayesFP:通过费曼-KAC抽样对基于流量的政策进行后验估计

Factor-Aware Mixture-of-Experts with Pretrained Encoder for Combinatorial Generalization

基于预训练编码器的因子感知专家混合组合组合推广

Temporal Logic Guidance for Action-Only Diffusion Policies with World Models

仅动作扩散政策的时序逻辑指导与世界模型