生成时间: 2026-05-20 18:53:03 (UTC+8); Arxiv 发布时间: 2026-05-20 20:00 EDT (2026-05-21 08:00 UTC+8)

今天共有 48 篇相关文章

Keyword: reinforcement learning

ReCrit: Transition-Aware Reinforcement Learning for Scientific Critic Reasoning

ReCrit:科学批评推理的过渡感知强化学习

Composition of Memory Experts for Diffusion World Models

扩散世界模型内存专家的组成

Hybrid-LoRA: Bridging Full Fine-Tuning and Low-Rank Adaptation for Post-Training

混合型LoRA:连接全方位微调与低级适应,适应后期训练

The fitness landscape of social norms in social dilemmas

社会困境中社会规范的适应度景观

From Cumulative Constraints to Adaptive Runtime Safety Control for Nonstationary Reinforcement Learning

从累积约束到非平稳强化学习的自适应运行时安全控制

Safe Continual Reinforcement Learning under Nonstationarity via Adaptive Safety Constraints

通过自适应安全约束实现非平稳性下的安全持续强化学习

Exact Linear Attention

精确线性注意力

STRIDE: Learnable Stepwise Language Feedback for LLM Reasoning

STRIDE:可学习的逐步语言反馈,用于LLM推理

SAGE: Shaping Anchors for Guided Exploration in RLVR of LLMs

SAGE:塑造大型语言模型(LLM)中引导探索的锚点

Emergence of a Flow-Assisted Casting Strategy for Olfactory Navigation via Memory-Augmented Reinforcement Learning

通过记忆增强强化学习实现嗅觉导航的流动辅助投法策略的出现

Reasoning Portability: Guiding Continual Learning for MLLMs in the RLVR Era

推理的可移植性:指导RLVR时代MLLM的持续学习

TabQL: In-Context Q-Learning with Tabular Foundation Models

TabQL:基于表格基础模型的上下文Q学习

Prompt Optimization for LLM Code Generation via Reinforcement Learning

通过强化学习实现LLM代码生成的提示优化

A Heuristic Approach for Performance Tuning in RL-based Quadrotor Control via Reward Design and Termination Conditions

基于强化学习的四旋翼控制性能调优的启发式方法,通过奖励设计和终止条件

Aerial Inspection Behaviors via RL-based Quadrotor Control for Under-canopy Forest Environments

基于RL的四旋翼控制对树冠下森林环境的空中检查行为

GAE Falls Short in Imperfect-Information Self-Play Reinforcement Learning

GAE在信息不完美自玩强化学习方面表现不足

Rethinking Muon Beyond Pretraining: Spectral Failures and High-Pass Remedies for VLA and RLVR

重新思考μ子超越预训练:VLA和RLVR的频谱失效与高通补救方法

UAV-Assisted Cooperative Edge Inference for Low-Altitude Economy via MoE-based Hierarchical Deep Reinforcement Learning

无人机辅助的协作边缘推断,通过基于MoE的层级深度强化学习实现低空经济

LambdaPO: A Lambda Style Policy Optimization for Reasoning Language Models

LambdaPO:一种用于推理语言模型的Lambda风格策略优化

When to Stop Reusing: Dynamic Gradient Gating for Sample-Efficient RLVR

何时停止重复使用:动态梯度门控以实现采样高效RLVR

CEPO: RLVR Self-Distillation using Contrastive Evidence Policy Optimization

CEPO:利用对比证据政策优化进行RLVR自我蒸馏

When the Majority Votes Wrong, the Intervention Timing for Test-Time Reinforcement Learning Hides in the Extinction Window

当多数人投票错误时,测试时强化学习的干预时间会隐藏在消退窗口中

What and When to Distill: Selective Hindsight Distillation for Multi-Turn Agents

蒸馏什么、何时蒸馏:多回合特工的选择性事后诸葛亮萃取

Generative Auto-Bidding with Unified Modeling and Exploration

生成自动竞价与统一建模与探索

Beyond Mode Collapse: Distribution Matching for Diverse Reasoning

超越模式崩溃:多元推理下的分布匹配

Sampling-Based Safe Reinforcement Learning

基于抽样的安全强化学习

Attention-Guided Reward for Reinforcement Learning-based Jailbreak against Large Reasoning Models

基于强化学习的注意力引导奖励,针对大型推理模型的越狱

ARC-RL: A Reinforcement Learning Playground Inspired by ARC Raiders

ARC-RL:一个受ARC突袭者启发的强化学习游乐场

SafeAlign-VLA: A Negative-Enhanced Safe Alignment Framework for Risk-Aware Autonomous Driving

SafeAlign-VLA:一个负增强的安全对齐框架,用于风险意识自动驾驶

Optimising Neural Speech Codecs for 300bps Communication using Reinforcement Learning

利用强化学习优化300bps通信的神经语音编解码器

GoLongRL: Capability-Oriented Long Context Reinforcement Learning with Multitask Alignment

GoLongRL:能力导向的长上下文强化学习,具多任务对齐

Implicit Action Chunking for Smooth Continuous Control

隐式动作分块以实现平滑连续控制

Projecting Latent RL Actions: Towards Generalizable and Scalable Graph Combinatorial Optimization

潜在强化学习动作的投影:迈向可推广和可扩展的图组合优化

Memory-Augmented Reinforcement Learning Agent for CAD Generation

用于CAD生成的记忆增强强化学习代理

Minimax Optimal Variance-Aware Regret Bounds for Multinomial Logistic MDPs

多项逻辑斯 MDP 的最优方差感知遗憾边界极小极大

Are Tools Always Beneficial? Learning to Invoke Tools Adaptively for Dual-Mode Multimodal LLM Reasoning

工具总是有益的吗?学习如何自适应调用工具以实现双模多模态大型语言模型推理

Fair-Aurora: Comparing Fairness Strategies for Reinforcement Learning-Based Congestion Control in Multi-Flow Environments

Fair-Aurora:比较多流环境中基于强化学习的拥塞控制公平策略

Beyond Action Residuals: Real-World Robot Policy Steering via Bottleneck Latent Reinforcement Learning

超越行动残差:通过瓶颈潜在强化学习实现现实世界的机器人政策引导

RoHIL: Robust Human-in-the-Loop Robotic Reinforcement Learning Against Illumination Variations

RoHIL:针对照明变化的稳健人机在环机器人强化学习

JAXenstein: Accelerated Benchmarking for First-Person Environments

JAXenstein:第一人称环境的加速基准测试

Safe Deep Reinforcement Learning for Spacecraft Reorientation with Pointing Keep-Out Constraint

带指向排除约束的航天器重新定向的安全深度强化学习

A conceptual framework for learning to listen by reward: Curiosity-driven search for novel sources

通过奖励学习倾听的概念框架:好奇心驱动的新颖资源搜索

CogOmniControl: Reasoning-Driven Controllable Video Generation via Creative Intent Cognition

CogOmniControl:通过创造性意图认知实现推理驱动的可控视频生成

GeoX: Mastering Geospatial Reasoning Through Self-Play and Verifiable Rewards

GeoX:通过自玩和可验证奖励掌握地理空间推理

When Critics Disagree: Adaptive Reward Poisoning Attacks in RIS-Aided Wireless Control System

当批评者意见不合:RIS辅助无线控制系统中的自适应奖励中毒攻击

Rewarding Beliefs, Not Actions: Consistency-Guided Credit Assignment for Long-Horizon Agents

奖励信念,而非行动:长期代理的一致性指导信用分配

Text-to-SPARQL Generation with Reinforcement Learning: A GRPO-based Approach on DBLP

基于基于GRPO的DBLP方法的文本到SPARQL生成与强化学习

Not Every Rubric Teaches Equally: Policy-Aware Rubric Rewards for RLVR

并非所有评分标准都同样有效:RLVR的政策意识评分标准奖励

Keyword: diffusion policy

There is no result