生成时间: 2026-03-03 16:47:43 (UTC+8); Arxiv 发布时间: 2026-03-03 20:00 EST (2026-03-04 09:00 UTC+8)

今天共有 95 篇相关文章

Keyword: reinforcement learning

Reinforcement Learning for Control with Probabilistic Stability Guarantee: A Finite-Sample Approach

基于概率稳定性保证的控制强化学习:有限样本方法

Breaking the Factorization Barrier in Diffusion Language Models

打破扩散语言模型中的分解障碍

Safe Multi-Agent Deep Reinforcement Learning for Privacy-Aware Edge-Device Collaborative DNN Inference

安全的多智能体深度强化学习,用于隐私意识的边缘设备协作DNN推理

Steering Away from Memorization: Reachability-Constrained Reinforcement Learning for Text-to-Image Diffusion

避免记忆:文本到图像扩散的可达性约束强化学习

FlowPortrait: Reinforcement Learning for Audio-Driven Portrait Video Generation

FlowPortrait:音频驱动的人像视频生成强化学习

Bridging Policy and Real-World Dynamics: LLM-Augmented Rebalancing for Shared Micromobility Systems

桥接政策与现实世界动态:共享微出行系统的LLM增强再平衡

RLShield: Practical Multi-Agent RL for Financial Cyber Defense with Attack-Surface MDPs and Real-Time Response Orchestration

RLShield:实用多代理强化学习,支持金融网络防御,配备攻击面MDP和实时响应编排

VisRef: Visual Refocusing while Thinking Improves Test-Time Scaling in Multi-Modal Large Reasoning Models

VisRef:思考时视觉重新聚焦改善多模态大推理模型中的测试时间尺度

Stepwise Penalization for Length-Efficient Chain-of-Thought Reasoning

长度高效思维链推理的逐步惩罚

Conservative Equilibrium Discovery in Offline Game-Theoretic Multiagent Reinforcement Learning

离线博弈论多智能体强化学习中的保守均衡发现

Hereditary Geometric Meta-RL: Nonlocal Generalization via Task Symmetries

遗传几何元强化学习:通过任务对称性实现非局域推广

HydroShear: Hydroelastic Shear Simulation for Tactile Sim-to-Real Reinforcement Learning

HydroShear:用于触觉模拟到真实强化学习的水弹性剪切模拟

Cloud-OpsBench: A Reproducible Benchmark for Agentic Root Cause Analysis in Cloud Systems

Cloud-OpsBench:云系统中代理性根因分析的可重现基准

Optimal-Horizon Social Robot Navigation in Heterogeneous Crowds

异质人群中的最优视野社交机器人导航

Mesh-Pro: Asynchronous Advantage-guided Ranking Preference Optimization for Artist-style Quadrilateral Mesh Generation

Mesh-Pro:异步优势引导的排名偏好优化,用于艺术家式四边形网格生成

LOGIGEN: Logic-Driven Generation of Verifiable Agentic Tasks

LOGIGEN:逻辑驱动的可验证代理任务生成

Learning to Attack: A Bandit Approach to Adversarial Context Poisoning

学习攻击:一种强盗式的对抗性情境中毒方法

Learning to Explore: Policy-Guided Outlier Synthesis for Graph Out-of-Distribution Detection

学习探索:图分布外检测的策略引导离群值综合

From Simulation to Reality: Practical Deep Reinforcement Learning-based Link Adaptation for Cellular Networks

从仿真到现实:基于细胞网络的基于深度强化学习的链路适配实用

Frozen Policy Iteration: Computationally Efficient RL under Linear $Q^π$ Realizability for Deterministic Dynamics

冻结策略迭代:线性$Q^π$确定性动力学实现下的计算高效强化学习

Keyframe-Guided Structured Rewards for Reinforcement Learning in Long-Horizon Laboratory Robotics

长视野实验室机器人强化学习的关键帧引导结构化奖励

RLAR: An Agentic Reward System for Multi-task Reinforcement Learning on Large Language Models

RLAR:大型语言模型上多任务强化学习的智能奖励系统

Qwen3-Coder-Next Technical Report

Qwen3-Coder-Next 技术报告

MO-MIX: Multi-Objective Multi-Agent Cooperative Decision-Making With Deep Reinforcement Learning

MO-MIX:多目标多代理协作决策,采用深度强化学习

CHIMERA: Compact Synthetic Data for Generalizable LLM Reasoning

CHIMERA:用于通用大型语言模型推理的紧凑合成数据

Principled Fast and Meta Knowledge Learners for Continual Reinforcement Learning

原则性快速和元知识学习者,用于持续强化学习

Minimalist Compliance Control

极简合规控制

HierKick: Hierarchical Reinforcement Learning for Vision-Guided Soccer Robot Control

HierKick:视觉引导足球机器人控制的层级强化学习

Stabilizing Policy Optimization via Logits Convexity

通过Logits 凸性稳定策略优化

Intent-Context Synergy Reinforcement Learning for Autonomous UAV Decision-Making in Air Combat

意图-情境协同强化学习用于空战中自主无人机决策

HiMAC: Hierarchical Macro-Micro Learning for Long-Horizon LLM Agents

HiMAC:面向长期视野LLM代理的层级宏观微观学习

MM-DeepResearch: A Simple and Effective Multimodal Agentic Search Baseline

MM-DeepResearch:一个简单有效的多模态代理搜索基线

Unleashing VLA Potentials in Autonomous Driving via Explicit Learning from Failures

通过显式学习从故障中释放自动驾驶中的VLA潜力

How RL Unlocks the Aha Moment in Geometric Interleaved Reasoning

强化学习如何解锁几何交错推理中的顿悟时刻

DIVA-GRPO: Enhancing Multimodal Reasoning through Difficulty-Adaptive Variant Advantage

DIVA-GRPO:通过难度自适应变异优势提升多模态推理

DeepResearch-9K: A Challenging Benchmark Dataset of Deep-Research Agent

DeepResearch-9K:深度研究代理的挑战性基准数据集

BeautyGRPO: Aesthetic Alignment for Face Retouching via Dynamic Path Guidance and Fine-Grained Preference Modeling

BeautyGRPO:通过动态路径指导和细粒度偏好建模实现面部修饰的美学对齐

PARWiS: Winner determination under shoestring budgets using active pairwise comparisons

PARWiS:在极限预算下利用主动两两比较进行赢家判定

Reasoning Boosts Opinion Alignment in LLMs

推理提升了大型语言模型中的观点一致性

Epistemic Gain, Aleatoric Cost: Uncertainty Decomposition in Multi-Agent Debate for Math Reasoning

认识收益,偶然成本:数学推理中多智能体辩论中的不确定性分解

Learn Hard Problems During RL with Reference Guided Fine-tuning

在强化学习中学习难题,参考引导的微调

Can Thinking Models Think to Detect Hateful Memes?

思考型模型能识别仇恨表情包吗?

Towards Policy-Adaptive Image Guardrail: Benchmark and Method

迈向政策适应性形象护栏:基准与方法

MOSAIC: A Unified Platform for Cross-Paradigm Comparison and Evaluation of Homogeneous and Heterogeneous Multi-Agent RL, LLM, VLM, and Human Decision-Makers

MOSAIC:一个统一平台,用于跨范式比较与评估同质与异构多智能体强化学习、大型语言模型、VLM及人类决策者

Beyond Reward: A Bounded Measure of Agent Environment Coupling

超越奖励:代理与环境耦合的有界衡量

Integrating LTL Constraints into PPO for Safe Reinforcement Learning

将LTL约束整合进PPO以实现安全强化学习

Theoretical Perspectives on Data Quality and Synergistic Effects in Pre- and Post-Training Reasoning Models

关于训练前后推理模型中数据质量与协同效应的理论视角

When Does RL Help Medical VLMs? Disentangling Vision, SFT, and RL Gains

强化学习什么时候能帮助医疗VLM?理清视觉、SFT和RL的收益

Hybrid TD3: Overestimation Bias Analysis and Stable Policy Optimization for Hybrid Action Space

混合TD3:混合行动空间的过度估计偏差分析与稳定策略优化

Energy Efficient Traffic Scheduling For Optical LEO Satellite Downlinks

光学LEO卫星下行链路的节能流量调度

SubstratumGraphEnv: Reinforcement Learning Environment (RLE) for Modeling System Attack Paths

SubstratumGraphEnv:用于建模系统攻击路径的强化学习环境(RLE)

MIST-RL: Mutation-based Incremental Suite Testing via Reinforcement Learning

MIST-RL:基于突变的增量套件测试,通过强化学习实现

Securing the Floor and Raising the Ceiling: A Merging-based Paradigm for Multi-modal Search Agents

确保底线与提升天花板:基于合并的多模态搜索代理范式

Scaling Tasks, Not Samples: Mastering Humanoid Control through Multi-Task Model-Based Reinforcement Learning

扩展任务,而非样本:通过多任务模型强化学习掌握类人生物控制

ProtRLSearch: A Multi-Round Multimodal Protein Search Agent with Large Language Models Trained via Reinforcement Learning

ProtRLSearch:一种多轮多模态蛋白质搜索代理,采用强化学习训练的大型语言模型

Towards Robot Skill Learning and Adaptation with Gaussian Processes

迈向机器人技能学习与适应,采用高斯过程

Harmonizing Dense and Sparse Signals in Multi-turn RL: Dual-Horizon Credit Assignment for Industrial Sales Agents

多轮强化学习中密集与稀疏信号的协调:工业销售代理的双视角信用分配

LLM-assisted Semantic Option Discovery for Facilitating Adaptive Deep Reinforcement Learning

促进自适应深度强化学习的语义选项发现

GAC: Stabilizing Asynchronous RL Training for LLMs via Gradient Alignment Control

GAC:通过梯度对齐控制稳定LLM的异步强化学习训练

State-Action Inpainting Diffuser for Continuous Control with Delay

状态作用修复扩散器,用于带延迟的连续控制

LFPO: Likelihood-Free Policy Optimization for Masked Diffusion Models

LFPO:掩盖扩散模型的无似然策略优化

Beyond Length Scaling: Synergizing Breadth and Depth for Generative Reward Models

超越长度缩放:促进生成奖励模型的广度与深度

CARE: Towards Clinical Accountability in Multi-Modal Medical Reasoning with an Evidence-Grounded Agentic Framework

护理:迈向多模态医学推理中的临床问责,并以循证为基础的代理框架

ToolRLA: Fine-Grained Reward Decomposition for Tool-Integrated Reinforcement Learning Alignment in Domain-Specific Agents

ToolRLA:领域特定代理中工具集成强化学习对齐的细粒度奖励分解

Learning Thermal-Aware Locomotion Policies for an Electrically-Actuated Quadruped Robot

学习电动驱动四足机器人的热感知运动政策

Learning to Draft: Adaptive Speculative Decoding with Reinforcement Learning

学习起草:带有强化学习的自适应推测性解码

Chain-of-Context Learning: Dynamic Constraint Understanding for Multi-Task VRPs

上下文链学习:多任务VRP中的动态约束理解

MVR: Multi-view Video Reward Shaping for Reinforcement Learning

MVR:多视角视频奖励塑造用于强化学习

Cross-modal Identity Mapping: Minimizing Information Loss in Modality Conversion via Reinforcement Learning

跨模态身份映射:通过强化学习最小化模态转换中的信息丢失

TopoCurate:Modeling Interaction Topology for Tool-Use Agent Training

TopoCurate:用于工具使用代理训练的交互拓扑建模

Rethinking Policy Diversity in Ensemble Policy Gradient in Large-Scale Reinforcement Learning

重新思考大规模强化学习中整体策略梯度中的政策多样性

Hyperparameter Trajectory Inference with Conditional Lagrangian Optimal Transport

带条件拉格朗日最优运输的超参数轨迹推断

FireRed-OCR Technical Report

FireRed-OCR技术报告

SEAR: Sample Efficient Action Chunking Reinforcement Learning

SEAR:示例高效动作分块强化学习

Generative Visual Chain-of-Thought for Image Editing

生成式视觉思维链用于图像编辑

Visual Bias in Simulated Users: The Impact of Luminance and Contrast on Reinforcement Learning-based Interaction

模拟用户中的视觉偏见:亮度和对比度对基于强化学习交互的影响

Efficient RLVR Training via Weighted Mutual Information Data Selection

通过加权互信息数据选择实现高效的RLVR训练

LaST-VLA: Thinking in Latent Spatio-Temporal Space for Vision-Language-Action in Autonomous Driving

LaST-VLA:在自动驾驶中视觉-语言-行动的潜在时空空间思考

CoVe: Training Interactive Tool-Use Agents via Constraint-Guided Verification

CoVe:通过约束引导验证训练交互式工具使用代理

CharacterFlywheel: Scaling Iterative Improvement of Engaging and Steerable LLMs in Production

CharacterFlywheel:生产中涉及且可引导的大型语言模型的规模化迭代改进

Process Over Outcome: Cultivating Forensic Reasoning for Generalizable Multimodal Manipulation Detection

过程胜于结果:培养法医推理以实现可推广多模态作检测

Temporal Representations for Exploration: Learning Complex Exploratory Behavior without Extrinsic Rewards

探索的时间表征:学习无外在奖励的复杂探索行为

Expanding LLM Agent Boundaries with Strategy-Guided Exploration

通过策略引导探索拓展LLM代理边界

Accelerating PDE Surrogates via RL-Guided Mesh Optimization

通过强化学习引导网格优化加速偏微分方程替代

$π$-StepNFT: Wider Space Needs Finer Steps in Online RL for Flow-based VLAs

$π$-StepNFT:更广阔的空间需要更细致的在线强化学习以支持基于流量的VLA

Reinforcement Learning-Based Filters for Convection-Dominated Flows: Reference-Free and Reference-Guided Training

基于强化学习的对流主导流滤波器:无参考与参考引导训练

Learning from Synthetic Data Improves Multi-hop Reasoning

从合成数据中学习提升多跳推理能力

ACDC: Adaptive Curriculum Planning with Dynamic Contrastive Control for Goal-Conditioned Reinforcement Learning in Robotic Manipulation

ACDC:基于动态对比控制的自适应课程规划,用于机器人作中的目标条件强化学习

Pencil Puzzle Bench: A Benchmark for Multi-Step Verifiable Reasoning

铅笔拼图工作台:多步可验证推理的基准

LongRLVR: Long-Context Reinforcement Learning Requires Verifiable Context Rewards

LongRLVR:长上下文强化学习需要可验证的上下文奖励

Near-Optimal Regret for KL-Regularized Multi-Armed Bandits

吉隆坡正则化多臂强盗的近优遗憾

Tool Verification for Test-Time Reinforcement Learning

测试时强化学习工具验证

Reasoning Core: A Scalable Procedural Data Generation Suite for Symbolic Pre-training and Post-Training

Reasoning Core:一套可扩展的程序式数据生成套件,用于符号性预训练和后训练

Keyword: diffusion policy

Mean-Flow based One-Step Vision-Language-Action

基于平均流程的一步愿景-语言-行动

Closed-Loop Action Chunks with Dynamic Corrections for Training-Free Diffusion Policy

带动态修正的闭环动作块以实现无训练扩散策略