生成时间: 2026-05-19 19:23:55 (UTC+8); Arxiv 发布时间: 2026-05-19 20:00 EDT (2026-05-20 08:00 UTC+8)

今天共有 80 篇相关文章

Keyword: reinforcement learning

Mirror Descent-Type Algorithms for the Variational Inequality Problem with Functional Constraints

带有函数约束的变分不等式问题镜像下降型算法

Reducing Credit Assignment Variance via Counterfactual Reasoning Paths

通过反事实推理路径减少学分分配的差异

When Actions Disappear: Adversarial Action Removal in Self-Play Reinforcement Learning

当行动消失:自我游戏强化学习中的对抗性行动移除

A Structural Threshold in Decision Capacity Governs Collapse in Self-Play Reinforcement Learning

决策能力的结构性门槛规范了自我游戏强化学习中的崩溃

Investigating Action Encodings in Recurrent Neural Networks in Reinforcement Learning

在强化学习中研究循环神经网络中的动作编码

GeoSym127K: Scalable Symbolically-verifiable Synthesis for Multimodal Geometric Reasoning

GeoSym127K:多模几何推理的可扩展符号验证综合

OrbiSim: World Models as Differentiable Physics Engines for Embodied Intelligence

OrbiSim:世界模型作为具身智能的可微物理引擎

QuantFPFlow: Quantum Amplitude Estimation for Fokker--Planck Policy Optimisation in Continuous Reinforcement Learning

QuantFPFlow:连续强化学习中的福克-普朗克策略优化量子振幅估计

Peak-Detector: Explainable Peak Detection via Instruction-Tuned Large Language Models in Physiological Sign

峰值检测器:通过生理手势中指令调优的大型语言模型实现可解释的峰值检测

Identifiable Token Correspondence for World Models

世界模型的可识别令牌对应

REC-RL: Referring expression counting via Gaussian and range-based reward optimization

REC-RL:通过高斯和基于区间的奖励优化进行引用式计数

World Model-Enabled Causal Digital Twins for Semantic Communications in Physical AI Systems

世界模型支持的因果数字孪生,用于物理人工智能系统中的语义通信

EfficientTDMPC: Improved MPC Objectives for Sample-Efficient Continuous Control

高效TDMPC:改进的MPC目标,实现样品高效连续控制

PopuLoRA: Co-Evolving LLM Populations for Reasoning Self-Play

PopuLoRA:为推理自玩共同进化的大型语言模型群体

NeuroMAS: Multi-Agent Systems as Neural Networks with Joint Reinforcement Learning

NeuroMAS:多智能体系统作为神经网络与联合强化学习

AoI-MDP: An AoI Optimized Markov Decision Process (Student Abstract)

AoI-MDP:AoI优化的马尔可夫决策过程(学生摘要)

The Unlearnability Phenomenon in RLVR for Language Models

语言模型RLVR中的不可学习性现象

TIER: Trajectory-Invariant Execution Rewards for Multi-Step Tool Composition

TIER:多步工具组合的轨迹不变执行奖励

Decoupling KL and Trajectories: A Unified Perspective for SFT, DAgger, Offline RL, and OPD in LLM Distillation

断开KL与轨迹:大语言模型蒸馏中SFT、DAgger、离线强化学习和OPD的统一视角

Sketch Then Paint: Hierarchical Reinforcement Learning for Diffusion Multi-Modal Large Language Models

Sketch 然后绘制:扩散多模态大型语言模型中的层级强化学习

Pedestrian-Aware LLM-Driven Behavioral Planning for Autonomous Vehicles

以行人感知的大型语言模型驱动的自动驾驶车辆行为规划

Beyond Safety Filtering: Control Barrier Function-Informed Reinforcement Learning for Connected and Automated Vehicles

超越安全过滤:控制障碍功能知情强化学习,适用于互联和自动驾驶车辆

OmniVL-Guard Pro: A Tool-Augmented Agent for Omnibus Vision-Language Forensics

OmniVL-Guard Pro:一个用于Omnibus视觉语言取证的工具增强代理

Ranking-Aware Calibration for Reliable Multimodal Reinforcement Learning

排名感知校准,实现可靠的多模态强化学习

Learning-Zone Energy: Online Data Selection for Efficient RL Post-Training

学习区能量:在线数据选择以实现高效的强化学习后培训

D$^2$Evo: Dual Difficulty-Aware Self-Evolution for Data-Efficient Reinforcement Learning

D$^2$Evo:双难度感知自我进化,实现数据高效强化学习

Learning Multi-Timescale Abstractions for Hierarchical Combinatorial Planning

学习多时间尺度抽象以实现层级组合规划

A Red Teaming Framework for Evaluating Robustness of AI-enabled Security Orchestration, Automation, and Response Systems

一个红队框架,用于评估AI驱动的安全编排、自动化和响应系统的稳健性

From Imitation to Interaction: Mastering Game of Schnapsen with Shallow Reinforcement Learning

从模仿到互动:利用浅层强化学习掌握施纳普森游戏

Beyond Execution: Static-Analysis Rewards and Hint-Conditioned Diffusion RL for Code Generation

超越执行:静态分析奖励与提示条件扩散强化学习用于代码生成

Multi-LLM Systems Exhibit Robust Semantic Collapse

多LLM系统表现出稳健语义崩溃

Generating Realistic Safety-Critical Scenarios for Vehicle-Pedestrian Interactions

生成车辆与行人互动的真实安全关键场景

Step-wise Rubric Rewards for LLM Reasoning

LLM推理的分阶段评分标准奖励

Leveraging Error Diversity in Group Rollouts for Reinforcement Learning

在群体推广中利用错误多样性进行强化学习

Learning Fill-in Reduction Ordering via Graph Policy Optimization for Sparse Matrices

通过图策略优化学习稀疏矩阵的填充归约排序

Heterogeneous Information-Bottleneck Coordination Graphs for Multi-Agent Reinforcement Learning

多智能体强化学习中的异构信息瓶颈协调图

ClaHF: A Human Feedback-inspired Reinforcement Learning Framework for Improving Classification Tasks

ClaHF:一种基于人类反馈的强化学习框架,用于改进分类任务

DyGRO-VLA: Cross-Task Scaling of Vision-Language-Action Models via Dynamic Grouped Residual Optimization

DyGRO-VLA:通过动态分组残差优化实现视觉-语言-行动模型的跨任务尺度化

Self-supervised Hierarchical Visual Reasoning with World Model

自监督层级视觉推理与世界模型

How Off-Policy Can GRPO Be? Mu-GRPO for Efficient LLM Reinforcement Learning

GRPO能有多离谱?Mu-GRPO 用于高效 LLM 强化学习

SAPO: Step-Aligned Policy Optimization for Reasoning-Based Generative Recommendation

SAPO:基于推理的生成推荐的阶梯对齐策略优化

Fine-tuning Pocket-Aware Diffusion Models via Denoising Policy Optimization

通过去噪策略优化微调口袋感知扩散模型

Entropy-Gradient Inversion: Moving Toward Internal Mechanism of Large Reasoning Models

熵-梯度反演:迈向大型推理模型的内部机制

HydroAgent: Closing the Gap Between Frontier LLMs and Human Experts in Hydrologic Model Calibration via Simulator-Grounded RL

HydroAgent:通过模拟器基础强化学习缩小前沿大型语言模型与人类水文模型校准专家之间的差距

Curriculum Group Policy Optimization: Adaptive Sampling for Unleashing the Potential of Text-to-Image Generation

课程组策略优化:自适应抽样以释放文本到图像生成潜力

Generating Pretraining Tokens from Organic Data for Data-Bound Scaling

从有机数据生成预训练令牌以实现数据绑定扩展

DAD4TS: Data-Augmentation-Oriented Diffusion Model for Time-Series Forecasting with Small-Scale Data

DAD4TS:面向数据增强的扩散模型,用于小尺度数据的时间序列预测

HINT-SD: Targeted Hindsight Self-Distillation for Long-Horizon Agents

HINT-SD:长视野代理人的有针对性事后诸葛亮自我提炼

An Efficient Streaming Video Understanding Framework with Agentic Control

一个高效的流媒体视频理解框架,具备代理控制

Transfer Learning for Customized Car Racing Environments

定制赛车环境的迁移学习

AtlasVA: Self-Evolving Visual Skill Memory for Teacher-Free VLM Agents

AtlasVA:为无教师VLM代理提供自我进化的视觉技能记忆

Enhancing the Code Reasoning Capabilities of LLMs via Consistency-based Reinforcement Learning

通过基于一致性的强化学习提升LLMs的代码推理能力

Generation Navigator: A State-Aware Agentic Framework for Image Generation

世代导航器:一种状态感知的图像生成智能框架

AutoVecCoder: Teaching LLMs to Generate Explicitly Vectorized Code

AutoVecCoder:教大语言模型生成显式矢量化代码

RL4RLA: Teaching ML to Discover Randomized Linear Algebra Algorithms Through Curriculum Design and Graph-Based Search

RL4RLA:通过课程设计和基于图的搜索教授机器学习发现随机线性代数算法

Interaction-Breaking Adversarial Learning Framework for Robust Multi-Agent Reinforcement Learning

破坏交互的对抗性学习框架,用于稳健的多智能体强化学习

LLM-Guided Communication for Cooperative Multi-Agent Reinforcement Learning

合作多智能体强化学习的LLM引导通信

Pairwise Preference Reward and Group-Based Diversity Enhancement for Superior Open-Ended Generation

两两偏好奖励与基于群体的多样性增强,以实现更优的开放式生成

Privacy Preserving Reinforcement Learning with One-Sided Feedback

以单方面反馈保护隐私强化学习

Knowledge-to-Verification: Exploring RLVR for LLMs in Knowledge-Intensive Domains

知识验证:探索知识密集型领域LLM的RLVR

SD-Search: On-Policy Hindsight Self-Distillation for Search-Augmented Reasoning

SD-搜索:搜索增强推理的政策事后自提炼

Alignment Dynamics in LLM Fine-Tuning

LLM微调中的对齐动态

ISEP: Implicit Support Expansion for Offline Reinforcement Learning via Stochastic Policy Optimization

ISEP:通过随机策略优化实现离线强化学习的隐式支持扩展

Beyond Inference-Time Search: Reinforcement Learning Synthesizes Reusable Solvers

超越推理时间搜索:强化学习综合可重复使用求解器

Heterogeneous Tasks Offloading in Vehicular Edge Computing: A Federated Meta Deep Reinforcement Learning Approach

车载边缘计算中的异构任务卸载:一种联邦元深度强化学习方法

Modelling Customer Trajectories with Reinforcement Learning for Practical Retail Insights

利用强化学习建模客户轨迹,获得实用零售洞察

Scheduling That Speaks: An Interpretable Programmatic Reinforcement Learning Framework

会说话的调度:一个可解释的程序化强化学习框架

DiPRL: Learning Discrete Programmatic Policies via Architecture Entropy Regularization

DiPRL:通过架构熵正则化学习离散程序化策略

AMR-SD: Asymmetric Meta-Reflective Self-Distillation for Token-Level Credit Assignment

AMR-SD:非对称元反思自我蒸馏,用于代币级信用分配

Unified Walking, Running, and Recovery for Humanoids via State-Dependent Adversarial Motion Priors

通过状态依赖的对抗性运动先验,实现类人生物的统一行走、跑步和恢复

ManiSoft: Towards Vision-Language Manipulation for Soft Continuum Robotics

ManiSoft:迈向软连续体机器人的视觉语言操作

Leveraging Latent Visual Reasoning in Silence

在沉默中运用潜在的视觉推理

COOPO: Cyclic Offline-Online Policy Optimization Algorithm

COOPO:周期性离线-在线策略优化算法

EnvFactory: Scaling Tool-Use Agents via Executable Environments Synthesis and Robust RL

EnvFactory:通过可执行环境综合和稳健强化学习扩展工具使用代理

SafeDiffusion-R1: Online Reward Steering for Safe Diffusion Post-Training

SafeDiffusion-R1:安全扩散训练后在线奖励引导

General Preference Reinforcement Learning

通用偏好强化学习

Keyword: diffusion policy

SADP: Subgoal-Aware Diffusion Policy for Explainable Robots Learned from Foundation Model Generated Demonstrations

SADP:从基础模型生成演示中学习到的可解释机器人子目标感知扩散政策

Contrastive Conceptor Activation Steering (COAST): Unlocking Vision-Language-Action Models through Hidden States

对比概念激活引导(COAST):通过隐藏状态解锁视觉-语言-行动模型

HCLM: A Hierarchical Framework for Cooperative Loco-Manipulation with Dual Quadrupeds

HCLM:双四足合作机车操作的层级框架

Fine-tuning Pocket-Aware Diffusion Models via Denoising Policy Optimization

通过去噪策略优化微调口袋感知扩散模型