生成时间: 2026-05-12 18:39:26 (UTC+8); Arxiv 发布时间: 2026-05-12 20:00 EDT (2026-05-13 08:00 UTC+8)

今天共有 122 篇相关文章

Keyword: reinforcement learning

Reinforcement learning for inverse structural design and rapid laser cutting of kirigami prototypes

用于逆结构设计和快速激光切割的强化学习

Distributional Reinforcement Learning via the Cramér Distance

通过克拉默距离的分布式强化学习

Interactive Inverse Reinforcement Learning of Interaction Scenarios via Bi-level Optimization

通过双级优化对交互场景进行交互式逆强化学习

Quantile Geometry Regularization for Distributional Reinforcement Learning

分布强化学习中的分位几何正则化

Beyond Penalization: Diffusion-based Out-of-Distribution Detection and Selective Regularization in Offline Reinforcement Learning

超越惩罚:基于扩散的分布外检测与离线强化学习中的选择性正则化

Path-Coupled Bellman Flows for Distributional Reinforcement Learning

分布式强化学习中的路径耦合贝尔曼流

Insider Attacks in Multi-Agent LLM Consensus Systems

多智能体LLM共识系统中的内部攻击

Anatomical Landmark-Guided Deep Reinforcement Learning for Autonomous Gastric Navigation

解剖标志引导深度强化学习,用于自主胃导航

LaWM: Least Action World Models for Long-Horizon Physical Consistency from Visual Observations

LaWM:基于视觉观测的长期物理一致性最小作用世界模型

HTPO: Towards Exploration-Exploitation Balanced Policy Optimization via Hierarchical Token-level Objective Control

HTPO:迈向探索-开发,通过层级代币级目标控制实现平衡策略优化

SalesSim: Benchmarking and Aligning Multimodal Language Models as Retail User Simulators

SalesSim:基准测试与对齐多模态语言模型作为零售用户模拟器

On Distinguishing Capability Elicitation from Capability Creation in Post-Training: A Free-Energy Perspective

关于在培训后区分能力诱导与能力创造:自由能源视角

Reinforcement Learning for Scalable and Trustworthy Intelligent Systems

可扩展且可信智能系统的强化学习

SACHI: Structured Agent Coordination via Holistic Information Integration in Multi-Agent Reinforcement Learning

SACHI:通过整体信息集成实现结构化代理协调,实现多智能体强化学习中的结构化代理协调

AIPO: : Learning to Reason from Active Interaction

AIPO:从主动互动中学习推理

Central Limit Theorem for Two-Time-Scale Approximate Distributionally Robust RL

两时间尺度的中心极限定理近似分布稳健强化学习

DUET: Optimize Token-Budget Allocation for Reinforcement Learning with Verifiable Rewards

DUET:优化代币-预算分配以实现可验证奖励的强化学习

PYTHALAB-MERA: Validation-Grounded Memory, Retrieval, and Acceptance Control for Frozen-LLM Coding Agents

PYTHALAB-MERA:冷冻大型语言模型编码代理的验证基础记忆、检索与验收控制

Mid-Training with Self-Generated Data Improves Reinforcement Learning in Language Models

中期训练中用自生成数据提升语言模型中的强化学习

Quantile-Coupled Flow Matching for Distributional Reinforcement Learning

分布强化学习中的分位数耦合流匹配

OracleTSC: Oracle-Informed Reward Hurdle and Uncertainty Regularization for Traffic Signal Control

OracleTSC:交通信号控制的Oracle知情奖励障碍与不确定性规范化

MARLaaS: Multi-Tenant Asynchronous Reinforcement Learning as a Service

MARLaaS:多租户异步强化学习即服务

Technical Report: A Hierarchical Dynamically Weighting Deep Reinforcement Learning Method for Multi-UAV Multi-Task Coordination

技术报告:一种用于多无人机多任务协调的层级动态加权深度强化学习方法

ReLibra: Routing-Replay-Guided Load Balancing for MoE Training in Reinforcement Learning

ReLibra:路由-重放-引导负载平衡,用于强化学习中的MoE训练

Structured Recurrent Mixers for Massively Parallelized Sequence Generation

结构化循环混合器用于大规模并行序列生成

REAP: Reinforcement-Learning End-to-End Autonomous Parking with Gaussian Splatting Simulator for Real2Sim2Real Transfer

REAP:基于高斯喷溅模拟器实现Real2Sim2Real传输的端到端自主停车强化学习

AgentForesight: Online Auditing for Early Failure Prediction in Multi-Agent Systems

AgentForesight:多智能体系统中早期故障预测的在线审计

Breaking the Impasse: Dual-Scale Evolutionary Policy Training for Social Language Agents

打破僵局:社会语言代理的双尺度进化政策培训

Generative Actor-Critic with Soft Bridge Policies

带有软桥政策的生成行为者-批评者

Value-Decomposed Reinforcement Learning Framework for Taxiway Routing with Hierarchical Conflict-Aware Observations

价值分解强化学习框架,用于滑行道路由,具有层级冲突感知观察

AHD Agent: Agentic Reinforcement Learning for Automatic Heuristic Design

AHD 代理:自动化启发式设计的智能强化学习

Omni-scale Learning-based Sequential Decision Framework for Order Fulfillment of Tote-handling Robotic Systems

基于学习的全尺度顺序决策框架,用于托特搬运机器人系统的订单履行

Not All Turns Matter: Credit Assignment for Multi-Turn Jailbreaking

并非所有回合都重要:多回合越狱的功劳分配

CoLVR: Enhancing Exploratory Latent Visual Reasoning via Contrastive Optimization

CoLVR:通过对比优化增强探索性潜在视觉推理

Constraint-Aware Diffusion Priors for High-Fidelity and Versatile Quadruped Locomotion

高保真和多功能四足行走的约束感知扩散先导

How You Begin is How You Reason: Driving Exploration in RLVR via Prefix-Tuned Priors

《你如何开始就是你的推理:通过前缀调优先验驱动RLVR中的探索》

BubbleSpec: Turning Long-Tail Bubbles into Speculative Rollout Drafts for Synchronous Reinforcement Learning

BubbleSpec:将长尾气泡转化为同步强化学习的推测性推广草稿

Preserving Foundational Capabilities in Flow-Matching VLAs through Conservative SFT

通过保守SFT保持流匹配VLA的基础能力

Forge: Quality-Aware Reinforcement Learning for NP-Hard Optimization in LLMs

Forge:用于LLM中NP难优化的质量感知强化学习

Internalizing Safety Understanding in Large Reasoning Models via Verification

通过验证在大型推理模型中内化安全理解

Self-ReSET: Learning to Self-Recover from Unsafe Reasoning Trajectories

自我重置:学会从不安全的推理轨迹中自我恢复

A Single Deep Preference-Conditioned Policy for Learning Pareto Coverage Sets

用于学习帕累托覆盖集的单一深度偏好条件策略

Learning to Explore: Scaling Agentic Reasoning via Exploration-Aware Policy Optimization

学习探索:通过探索感知策略优化扩展代理推理

ParityFuzz: Finding Inconsistencies across Solidity Compilers via Fine-Grained Mutation and Differential Analysis

ParityFuzz:通过细粒度突变和微分分析发现固体编译器间的不一致

BoostAPR: Boosting Automated Program Repair via Execution-Grounded Reinforcement Learning with Dual Reward Models

BoostAPR:通过执行基础强化学习和双奖励模型提升自动程序修复

Beyond Self-Play: Hierarchical Reasoning for Continuous Motion in Closed-Loop Traffic Simulation

超越自我游戏:闭环交通模拟中连续运动的层级推理

Revisiting Mixture Policies in Entropy-Regularized Actor-Critic

在熵正则化的行为者-批评者中重新审视混合策略

Data-Driven Inverse Reinforcement Learning of Linear Systems with Model Uncertainty: A Convex Optimization View

基于数据驱动的线性系统逆强化学习模型不确定性:凸优化视图

DARE: Difficulty-Adaptive Reinforcement Learning with Co-Evolved Difficulty Estimation

DARE:难度自适应强化学习与共进化难度估计

Rethinking Ratio-Based Trust Regions for Policy Optimization in Multi-Agent Reinforcement Learning

重新思考基于比率的信任区域在多智能体强化学习中的策略优化

Fast Rates for Offline Contextual Bandits with Forward-KL Regularization under Single-Policy Concentrability

在单策略集中下采用前向KL正则化的离线上下文强盗快速速率

Learning the Preferences of a Learning Agent

学习智能体的偏好

Cornerstones or Stumbling Blocks? Deciphering the Rock Tokens in On-Policy Distillation

基石还是绊脚石?在政策提炼中解码岩石代币

Reinforcing Multimodal Reasoning Against Visual Degradation

强化多模态推理以反对视觉退化

DeltaRubric: Generative Multimodal Reward Modeling via Joint Planning and Verification

DeltaRubric:通过联合规划与验证实现生成多模态奖励建模

PiCA: Pivot-Based Credit Assignment for Search Agentic Reinforcement Learning

PiCA:基于转折的学分作业用于搜索能动强化学习

dFlowGRPO: Rate-Aware Policy Optimization for Discrete Flow Models

dFlowGRPO:离散流模型的速率感知策略优化

Functional Graphs for Predicting and Explaining Goal Failure in Sparse Goal-Conditioned RL

稀疏目标条件强化学习中预测和解释目标失败的功能图

Skill-R1: Agent Skill Evolution via Reinforcement Learning

技能-R1:通过强化学习实现代理技能进化

Multi-scale Predictive Representations for Goal-conditioned Reinforcement Learning

目标条件强化学习的多尺度预测表征

From Passive Reuse to Active Reasoning: Grounding Large Language Models for Neuro-Symbolic Experience Replay

从被动再利用到主动推理:为神经符号体验重放奠定大型语言模型基础

Perception Without Engagement: Dissecting the Causal Discovery Deficit in LMMs

无互动的感知:剖析LMM中的因果发现缺失

Crosslingual On-Policy Self-Distillation for Multilingual Reasoning

跨语言政策自我提炼以实现多语言推理

Neuromorphic Reinforcement Learning for Quadruped Locomotion Control on Uneven Terrain

四足行走控制的神经形态强化学习

Any2Any 3D Diffusion Models with Knowledge Transfer: A Radiotherapy Planning Study

任何带有知识转移的3D扩散模型:放疗规划研究

Plan2Cleanse: Test-Time Backdoor Defense via Monte-Carlo Planning in Deep Reinforcement Learning

Plan2Cleanse:通过蒙特卡洛规划在深度强化学习中的测试时间后门防御

Adaptive Data Harvesting for Efficient Neural Network Learning with Universal Constraints

自适应数据采集,实现高效神经网络学习,具有普遍约束

On-Policy Distillation with Best-of-N Teacher Rollout Selection

配合最佳教师推广选择的政策提炼

One for All: A Non-Linear Transformer can Enable Cross-Domain Generalization for In-Context Reinforcement Learning

一人为所有人:非线性变换器可以实现跨域推广,实现上下文强化学习

Operationalizing Cybersecurity Governance for Mitigation Planning with Attack-Path Modeling and Reinforcement Learning

通过攻击路径建模和强化学习,实现网络安全治理以实现缓解规划

LEAD: Length-Efficient Adaptive and Dynamic Reasoning for Large Language Models

LEAD:大型语言模型的长度高效自适应与动态推理

Quantifying the Utility of User Simulators for Building Collaborative LLM Assistants

量化用户模拟器在构建协作式LLM助手中的实用性

Learning to Compress Time-to-Control: A Reinforcement Learning Framework for Chronic Disease Management

学习压缩控制时间:慢性病管理的强化学习框架

Geometric Pareto Control: Riemannian Gradient Flow of Energy Function via Lie Group Homotopy

几何帕累托控制:通过李群同伦实现的黎曼梯度能量函数流

Exploration-Driven Optimization for Test-Time Large Language Model Reasoning

探索驱动的测试时大型语言模型推理优化

Separate First, Fuse Later: Mitigating Cross-Modal Interference in Audio-Visual LLMs Reasoning with Modality-Specific Chain-of-Thought

先分离,后融合:通过模态特定思维链缓解视听大型语言模型推理中的跨模态干扰

Verifier-Free RL for LLMs via Intrinsic Gradient-Norm Reward

通过内在梯度-范数奖励实现无验证器的大型语言模型强化学习

expo: Exploration-prioritized policy optimization via adaptive kl regulation and gaussian curriculum sampling

EXPO:通过自适应的基层调律和高斯课程抽样进行探索优先政策优化

TRACER: Verifiable Generative Provenance for Multimodal Tool-Using Agents

TRACER:多模态工具使用代理的可验证生成来源

HAGE: Harnessing Agentic Memory via RL-Driven Weighted Graph Evolution

HAGE:通过强化学习驱动的加权图演化利用能动记忆

Towards Generalist Game Players: An Investigation of Foundation Models in the Game Multiverse

面向通才游戏玩家:游戏多元宇宙中基础模型的研究

Consolidation-Expansion Operator Mechanics:A Unified Framework for Adaptive Learning

巩固-扩展算子力学:自适应学习的统一框架

Beyond Self-Play and Scale: A Behavior Benchmark for Generalization in Autonomous Driving

超越自我游戏与规模化:自动驾驶泛化的行为基准

Adaptive Action Chunking via Multi-Chunk Q Value Estimation

通过多块Q值估计实现自适应动作分块

EFGCL: Learning Dynamic Motion through Spotting-Inspired External Force Guided Curriculum Learning

EFGCL:通过观察启发的外部力引导课程学习动态运动

Plan in Sandbox, Navigate in Open Worlds: Learning Physics-Grounded Abstracted Experience for Embodied Navigation

在沙盒中规划,在开放世界中导航:学习基于物理的抽象体验以实现具身导航

FormalRewardBench: A Benchmark for Formal Theorem Proving Reward Models

FormalRewardBench:证明奖励模型的形式定理基准

Is DRL-based MAC Ready for Underwater Acoustic Networks? Exploring Its Practicality in Real Field Experiments

基于日磁学习的MAC是否适合水下声学网络?探索其在真实实地实验中的实用性

Unsupervised Process Reward Models

无监督过程奖励模型

Balancing Efficiency and Fairness in Traffic Light Control through Deep Reinforcement Learning

通过深度强化学习平衡交通信号灯控制的效率与公平性

MTA-RL: Robust Urban Driving via Multi-modal Transformer-based 3D Affordances and Reinforcement Learning

MTA-RL:通过多模态变压器三维赋能与强化学习实现的强健城市驾驶

TRACE: Distilling Where It Matters via Token-Routed Self On-Policy Alignment

TRACE:通过令牌路由的自策略对齐,在关键部位提炼

Relative Score Policy Optimization for Diffusion Language Models

扩散语言模型的相对评分策略优化

When Does Non-Uniform Replay Matter in Reinforcement Learning?

非均匀回放在强化学习中什么时候重要?

Towards Autonomous Railway Operations: A Semi-Hierarchical Deep Reinforcement Learning Approach to the Vehicle Rescheduling Problem

迈向自主铁路运营:一种半分层式深度强化学习方法解决车辆调度问题

MemReread: Enhancing Agentic Long-Context Reasoning via Memory-Guided Rereading

MemReread:通过记忆引导重读增强能动的长语境推理能力

Robust Probabilistic Shielding for Safe Offline Reinforcement Learning

稳健的概率屏蔽,用于安全离线强化学习

Verifiable Process Rewards for Agentic Reasoning

代理推理的可验证过程奖励

TMAS: Scaling Test-Time Compute via Multi-Agent Synergy

TMAS:通过多智能体协同扩展测试时间计算

PC3D: Zero-Shot Cooperation Across Variable Rosters via Personalized Context Distillation

PC3D:通过个性化上下文蒸馏实现可变名单间零机会合作

Causal Explanations from the Geometric Properties of ReLU Neural Networks

ReLU神经网络几何属性的因果解释

HiRL: Hierarchical Reinforcement Learning for Coordinated Resource Management in Heterogeneous Edge Computing

HiRL:异构边缘计算中协调资源管理的分层强化学习

Uni-Synergy: Bridging Understanding and Generation for Personalized Reasoning via Co-operative Reinforcement Learning

Uni-Synergy:通过合作强化学习桥接理解与个性化推理的生成

Safe Multi-Agent Behavior Must Be Maintained, Not Merely Asserted: Constraint Drift in LLM-Based Multi-Agent Systems

必须保持安全的多智能体行为,而不仅仅是断言:基于LLM的多智能体系统中的约束漂移

Priority-Driven Control and Communication in Decentralized Multi-Agent Systems via Reinforcement Learning

通过强化学习实现去中心化多智能体系统中的优先级驱动控制与通信

DeepRefine: Agent-Compiled Knowledge Refinement via Reinforcement Learning

DeepRefine:通过强化学习实现智能体编译知识精炼

Higher Resolution, Better Generalization: Unlocking Visual Scaling in Deep Reinforcement Learning

更高分辨率,更优的泛化:在深度强化学习中解锁视觉缩放

PhysEDA: Physics-Aware Learning Framework for Efficient EDA With Manhattan Distance Decay

PhysEDA:适用于曼哈顿距离衰减的高效物理感知学习框架

Controllability in preference-conditioned multi-objective reinforcement learning

偏好条件多目标强化学习中的可控性

Demystifying Deep Reinforcement Learning: A Neuro-Symbolic Framework for Interpretable Open RAN Automation

揭开深度强化学习的神秘面纱:一个用于可解释开放RAN自动化的神经符号框架

Evolving-RL: End-to-End Optimization of Experience-Driven Self-Evolving Capability within Agents

Evolving-RL:代理内体验驱动自我演化能力的端到端优化

Natural Policy Gradient as Doubly Smoothed Policy Iteration: A Bellman-Operator Framework

自然策略梯度作为双平滑策略迭代:一个贝尔曼-操作员框架

XQCfD: Accelerating Fast Actor-Critic Algorithms with Prior Data and Prior Policies

XQCfD:利用先前数据和策略加速快速演员-批评算法

Likelihood scoring for continuations of mathematical text: a self-supervised benchmark with tests for shortcut vulnerabilities

数学文本续写的似然评分:带有捷径漏洞测试的自监督基准

Policy Gradient Methods for Non-Markovian Reinforcement Learning

非马尔可夫强化学习的策略梯度方法

Unified Noise Steering for Efficient Human-Guided VLA Adaptation

统一噪声引导,实现高效的人控VLA适配

Towards On-Policy Data Evolution for Visual-Native Multimodal Deep Search Agents

迈向可视化原生多模态深度搜索代理的策略上数据演进

BenchCAD: A Comprehensive, Industry-Standard Benchmark for Programmatic CAD

BenchCAD:程序化CAD的全面、行业标准基准

RubricEM: Meta-RL with Rubric-guided Policy Decomposition beyond Verifiable Rewards

评分标准EM:基于评分标准的策略分解超越可验证奖励的元强化学习

Dynamic Skill Lifecycle Management for Agentic Reinforcement Learning

智能强化学习的动态技能生命周期管理

Power Reinforcement Post-Training of Text-to-Image Models with Super-Linear Advantage Shaping

超线性优势塑形的文本转图像模型训练后强化功率

Keyword: diffusion policy

HeteroGenManip: Generalizable Manipulation For Heterogeneous Object Interactions

异构操作:针对异构对象交互的可推广操作