生成时间: 2026-05-25 19:54:58 (UTC+8); Arxiv 发布时间: 2026-05-25 20:00 EDT (2026-05-26 08:00 UTC+8)

今天共有 34 篇相关文章

Keyword: reinforcement learning

FuRA: Full-Rank Parameter-Efficient Fine-Tuning with Spectral Preconditioning

FuRA:全秩参数高效微调与谱预处理

NeuroNL2LTL: A Neurosymbolic Framework for Natural Language Translation of Linear Temporal Logic

NeuroNL2LTL:一种用于线性时间逻辑自然语言翻译的神经符号框架

SCRIPT: Scalable Diffusion Policy with Multi-stage Training for Language-driven Physics-Based Humanoid Control

SCRIPT:可扩展扩散政策,支持多阶段训练,用于语言驱动的物理类人控制

PIMbot: A Self-Adaptive Attack Framework for Adversarial Manipulation of Multi-Robot Reinforcement Learning

PIMbot:一种用于对抗性操作多机器人强化学习的自适应攻击框架

What Training Data Teaches RL Memory Agents: An Empirical Study of Curriculum Effects in Memory-Augmented QA

训练数据教给强化学习记忆代理:记忆增强质量保证中课程影响的实证研究

Dreaming Smoothly and Sample Efficiently with Gradient Penalized Latent Dynamics

顺畅做梦,高效采样,采用梯度惩罚潜在动力学

Infra-Bayesian Reinforcement Learning Agents Outperform Classical RL For Worst-Case Robustness

次贝叶斯强化学习代理在最坏情况下的鲁棒性方面优于经典强化学习

Pure Exploration for a Good Policy in Reinforcement Learning with Bandit Feedback

纯粹探索强化学习中带有强盗反馈的良好策略

Convex Optimization for Alignment and Preference Learning on a Single GPU

单GPU上的凸优化用于对齐和偏好学习

EvalVerse: Pipeline-Aware and Expert-Calibrated Benchmarking for Professional Cinematic Video Generation

EvalVerse:专业电影视频生成的流水线感知和专家校准基准

Reinforcement Learning for Microcanonical Graph Ensemble with Assortativity Constraints

具有排序约束的微正则图系综强化学习

Human-in-the-Loop Multi-Agent Ventilator Decision Support with Contextual Bandit Preference Learning

人机环路多智能体呼吸机决策支持,支持情境盗贼偏好学习

Score-Based One-step MeanFlow Policy Optimization

基于评分的一步均值流策略优化

Curriculum reinforcement learning with measurable task representation learning

课程强化学习与可测量任务表示学习

From Correctness to Preference: A Framework for Personalized Agentic Reinforcement Learning

从正确到偏好:个性化能动强化学习框架

Metacognition as Reward: Reinforcing LLM Reasoning via Knowledge and Regulation Signals

作为奖励的元认知:通过知识和调节信号强化LLM推理

Droneulator: A Portable UAV Simulator for Agricultural Workflows with RotorPy and Godot 4

无人机:一款便携式无人机模拟器,用于农业工作流程,支持RotorPy和Godot 4

Reflex: Reinforcement Learning with Reflection Symmetry Exploitation in State-Based Continuous Control

反射:基于状态的连续控制中利用反射对称性的强化学习

ARES: Automated Rubric Synthesis for Scalable LLM Reinforcement Learning

ARES:可扩展大型语言模型强化学习的自动化评分标准综合

CoSPlay: Cooperative Self-Play at Test-Time with Self-Generated Code and Unit Test

CoSplay:测试时的协作式自开发代码和单元测试

B-GRTO: Bootstrapped Group Relative Tool Optimization for Referring Segmentation

B-GRTO:引导组相对工具优化用于引用分割

Precise: SDE-Consistent Stochastic Sampling for RL Post-Training of Flow-Matching Models

精确:流匹配模型的强化学习后SDE一致随机抽样

Goal-Conditioned Agents that Learn Everything All at Once

目标条件化的代理,一次性学会一切

SafeSABR: Risk-Calibrated Adaptive Bitrate Streaming over Starlink Networks

SafeSABR:通过Starlink网络的风险校准自适应码率流媒体

ARMS: Automatic Reward Shaping for Sparse-Reward Multi-Agent Reinforcement Learning

ARMS:稀疏奖励多智能体强化学习中的自动奖励塑造

Understanding Goal Generalisation in Sequential Reinforcement Learning

理解顺序强化学习中的目标泛化

Less Effort, Shorter Proofs: Reinforcement Learning for Security Protocol Analysis in Tamarin

更少努力,更短的证明:Tamarin 安全协议分析的强化学习

One Policy, Infinite NPCs: Persona-Traceable Shared RL Policies for Scalable Game Agents

One Policy, Infinite NPC: 可扩展游戏代理的Persona可追踪共享强化学习策略

OnePred: Next-Query Prediction via Recursive Intent Memory in Multi-Turn Conversations

OnePred:通过递归意图记忆实现多回合对话中的下一查询预测

SeedER: Seed-and-Expand Retrieval from Knowledge Graphs

SeedER:从知识图谱中进行种子与扩展检索

Robotic Strawberry Harvesting with Robust Vision and Deep Reinforcement Learning based Sim-to-Real Control

基于模拟到真实控制的机器人草莓采摘与强大视觉和深度强化学习

Geo-Align: Video Generation Alignment via Metric Geometry Reward

地理对齐:通过度量几何奖励实现视频生成对齐

Keyword: diffusion policy

SCRIPT: Scalable Diffusion Policy with Multi-stage Training for Language-driven Physics-Based Humanoid Control

SCRIPT:可扩展扩散政策,支持多阶段训练,用于语言驱动的物理类人控制

Semantically Structured Mixture-of-Experts for Compositional Robotic Manipulation

用于合成机器人操作的语义结构专家混合