生成时间: 2026-05-21 19:30:44 (UTC+8); Arxiv 发布时间: 2026-05-21 20:00 EDT (2026-05-22 08:00 UTC+8)

今天共有 63 篇相关文章

Keyword: reinforcement learning

SOLAR: A Self-Optimizing Open-Ended Autonomous Agent for Lifelong Learning and Continual Adaptation

SOLAR:一种自我优化的开放式自主智能体,实现终身学习和持续适应

Tool-Augmented Agent for Closed-loop Optimization,Simulation,and Modeling Orchestration

闭环优化、仿真与建模编排的工具增强代理

Long-Context Reasoning Through Proxy-Based Chain-of-Thought Tuning

通过基于代理的思维链调优进行长上下文推理

NaP-Control: Navigating Diffusion Prior for Versatile and Fast Character Control

NaP控制:在扩散前期导航,实现灵活且快速的特征控制

GROW: Aligning GRPO with State-Action Modeling for Open-World VLM Agents

GROW:将GRPO与开放世界VLM代理的状态动作建模对齐

Multi-Agent Reinforcement Learning for Safe Autonomous Driving Under Pedestrian Behavioral Uncertainty

多智能体强化学习,在行人行为不确定性下实现安全自动驾驶

FBOS-RL: Feedback-Driven Bi-Objective Synergistic Reinforcement Learning

FBOS-RL:反馈驱动双目标协同强化学习

It Takes Two: Complementary Self-Distillation for Contextual Integrity in LLMs

这需要两者:在大型语言模型中实现上下文完整性的互补自我蒸馏

Conformal Selective Acting: Anytime-Valid Risk Control for RLVR-Trained LLMs

适形选择性行为:RLVR训练LLMs的随时有效风险控制

Smaller Abstract State Spaces Enable Cross-Scale Generalization in Reinforcement Learning

更小的抽象状态空间使强化学习中的跨尺度泛化成为可能

Regulating Anatomy-Aware Rewards via Trajectory-Integral Feedback for Volumetric Computed Tomography Analysis

通过轨迹整合反馈调控体积计算机断层扫描分析中的解剖感知奖励

JUDO: A Juxtaposed Domain-Oriented Multimodal Reasoner for Industrial Anomaly QA

JUDO:工业异常质量保证的并置领域导向多模态推理器

Introspective X Training: Feedback Conditioning Improves Scaling Across all LLM Training Stages

内省式X训练:反馈条件提升LLM所有训练阶段的扩展性

ParaVT: Taming the Tool Prior Paradox for Parallel Tool Use in Agentic Video Reinforcement Learning

ParaVT:驯服工具先行悖论以适应智能视频强化学习中的并行工具使用

ConceptSeg-R1: Segment Any Concept via Meta-Reinforcement Learning

ConceptSeg-R1:通过元强化学习分割任意概念

Decomposing MXFP4 quantization error for LLM reinforcement learning: reducible bias, recoverable deadzone, and an irreducible floor

用于大型语言模型强化学习的MXFP4量化误差分解:可约偏差、可恢复死区和不可约底

Spectral Souping: A Unified Framework for Online Preference Alignment

Spectral Souping:在线偏好对齐的统一框架

OSCToM: RL-Guided Adversarial Generation for High-Order Theory of Mind

OSCToM:高阶心智理论中的强化学习引导对抗生成

Reinforcing Human Behavior Simulation via Verbal Feedback

通过口头反馈强化人类行为模拟

Complementing reinforcement learning with SFT through logit averaging in the post training of LLMs

在LLM后训练中通过logit平均法补充SFT强化学习

Mahjax: A GPU-Accelerated Mahjong Simulator for Reinforcement Learning in JAX

Mahjax:一款基于 GPU 的加速麻将模拟器,用于 JAX 强化学习

Compositional Transduction with Latent Analogies for Offline Goal-Conditioned Reinforcement Learning

带有潜在类比的组合转导,用于离线目标条件强化学习

Design for Manufacturing: A Manufacturability Knowledge-Integrated Reinforcement Learning Framework for Free-Form Pipe Routing in Aeroengines

制造设计:用于航空发动机自由成型管道布线的可制造性知识集成钢筋学习框架

REFLECTOR: Internalizing Step-wise Reflection against Indirect Jailbreak

反射者:内化逐步反思反对间接越狱

IndusAgent: Reinforcing Open-Vocabulary Industrial Anomaly Detection with Agentic Tools

IndusAgent:用代理工具强化开放词汇工业异常检测

Distributed Direct Preference Optimization

分布式直接偏好优化

AGPO: Adaptive Group Policy Optimization with Dual Statistical Feedback

AGPO:带有双重统计反馈的自适应组策略优化

Distribution-Aware Reward: Reinforcement Learning over Predictive Distributions for LLM Regression

分布感知奖励:强化学习胜过预测分布对LLM回归的应用

Q-SpiRL: Quantum Spiking Reinforcement Learning for Adaptive Robot Navigation

Q-SpiRL:自适应机器人导航的量子尖峰强化学习

Conditional Equivalence of DPO and RLHF: Implicit Assumption, Failure Modes, and Provable Alignment

DPO和RLHF的条件等价性:隐性假设、失效模式与可证比对

Finite-Time Regret Analysis of Retry-Aware Bandits

重试感知强盗的有限时间遗憾分析

PlexRL: Cluster-Level Orchestration of Serviceized LLM Execution for RLVR

PlexRL:RLVR 服务化 LLM 执行的集群级编排

Multi-Step Likelihood-Ratio Correction for Reinforcement Learning with Verifiable Rewards

多步似然比修正用于可验证奖励的强化学习

ProCrit: Self-Elicited Multi-Perspective Reasoning with Critic-Guided Revision for Multimodal Sarcasm Detection

ProCrit:多模态讽刺检测的自引多视角推理与批评指导修订

PlanningBench: Generating Scalable and Verifiable Planning Data for Evaluating and Training Large Language Models

PlanningBench:生成可扩展且可验证的规划数据,用于评估和训练大型语言模型

CIG: Exploration via Conditional Information Gain

CIG:通过条件信息获取探索

For How Long Should We Be Punching? Learning Action Duration in Fighting Games

我们应该打多久?学习格斗游戏中的动作持续时间

Thinking-while-speaking: A Controlled, Interleaved Reasoning Method for Real-Time Speech Generation

边说边思考:一种可控、交错的实时语音生成推理方法

Beyond the Bellman Recursion: A Pontryagin-Guided Framework for Non-Exponential Discounting

超越贝尔曼递归:非指数折现的庞特里亚金引导框架

Decoupling Communication from Policy: Robust MARL under Bandwidth Constraints

与策略的通信解耦:带宽约束下的鲁棒MARL

Advantage Collapse in Group Relative Policy Optimization: Diagnosis and Mitigation

群体相对政策优化中的优势崩溃:诊断与缓解

Distill to Think, Foresee to Act: Cognitive-Physical Reinforcement Learning for Autonomous Driving

思考而预见行动:自动驾驶的认知-身体强化学习

Learning First Integrals via Backward-Generated Data and Guided Reinforcement Learning

通过逆向生成数据和引导强化学习学习先积分

ScenePilot: Controllable Boundary-Driven Critical Scenario Generation for Autonomous Driving

ScenePilot:可控边界驱动的自动驾驶临界场景生成

Domain-Adaptable Reinforcement Learning for Code Generation with Dense Rewards

用于高密度奖励代码生成的领域自适应强化学习

Reinforcement Learning-based Control via Y-wise Affine Neural Networks: Comparative Case Studies for Chemical Processes

基于强化学习的Y向仿射神经网络控制:化学过程的比较案例研究

Behavior-Consistent Deep Reinforcement Learning

行为一致性深度强化学习

PREFINE: Preference-Based Implicit Reward and Cost Fine-Tuning for Safety Alignment

PREFINE:基于偏好的隐性奖励与成本微调以实现安全对齐

LamPO: A Lambda Style Policy Optimization for Reasoning Language Models

LamPO:一种用于推理语言模型的Lambda风格策略优化

Reinforcement Learning for Risk Adaptation via Differentiable CVaR Barrier Functions

通过可微分的CVaR屏障函数进行风险适应的强化学习

How Much Online RL is Enough? Informative Rollouts for Offline Preference Optimization in RLVR

在线强化学习到底够用多少?RLVR离线偏好优化的实用推广

DriveMA: Rethinking Language Interfaces in Driving VLAs with One-Step Meta-Actions

DriveMA:重新思考驱动VLA中的语言接口,采用一步元行动

\textit{Stochastic} MeanFlow Policies: One-Step Generative Control with Entropic Mirror Descent

\textit{Stochastic} 平均流策略:带熵镜像下降的一步生成控制

TimeSRL: Generalizable Time-Series Behavioral Modeling via Semantic RL-Tuned LLMs -- A Case Study in Mental Health

TimeSRL:通过语义强化学习调优的大型语言模型进行可推广时间序列行为建模——心理健康案例研究

DeCoR: Design and Control Co-Optimization for Urban Streets Using Reinforcement Learning

DeCoR:利用强化学习设计与控制城市街道的协同优化

Learning Robust Dexterous In-Hand Manipulation from Joint Sensors with Proprioceptive Transformer

通过本体感觉变换器学习关节传感器的强健灵巧手部操作

Validating Navmesh using Geometry: Voxel-Based Analysis with Prioritized Exploration

利用几何验证导航网格:基于体素的分析与优先探索

roto 2.0: The Robot Tactile Olympiad

Roto 2.0:机器人触觉奥林匹克竞赛

Mem-$π$: Adaptive Memory through Learning When and What to Generate

Mem-$π$:通过学习何时何物生成的适应性记忆

DelTA: Discriminative Token Credit Assignment for Reinforcement Learning from Verifiable Rewards

DelTA:可验证奖励强化学习的辨别性代币信用分配

You Only Need Minimal RLVR Training: Extrapolating LLMs via Rank-1 Trajectories

你只需要最低限度的RLVR训练:通过第一阶轨迹推算LLMs。

Keyword: diffusion policy

NaP-Control: Navigating Diffusion Prior for Versatile and Fast Character Control

NaP控制:在扩散前期导航,实现灵活且快速的特征控制

Mobile UMI: Cross-View Diffusion Policy with Decoupled Kinematics for Mobile Manipulation

移动UMI:带解耦运动学的交叉视角扩散策略用于移动操作