生成时间: 2026-06-29 20:36:03 (UTC+8); Arxiv 发布时间: 2026-06-29 20:00 EDT (2026-06-30 08:00 UTC+8)

今天共有 28 篇相关文章

Keyword: reinforcement learning

OverFlowLight: Real-Time Gridlock Prevention and Traffic Signal Optimization for Urban Intersections

OverFlowLight:城市路口的实时交通堵塞预防与交通信号优化

Support-Constrained RL Enables Real-World Policy Improvement without Real-World Experience

支持受限的强化学习使得无需真实经验也能实现真实政策改进

Internalizing the Future: A Unified Agentic Training Paradigm for World Model Planning

内化未来:世界模型规划的统一代理训练范式

PEBS: Per-rater Empirical-Bayes Shrinkage for RLHF Reward-Model Calibration

PEBS:按评级者经验-贝叶斯收缩用于RLHF奖励模型校准

Retroactive Advantage Correction: Closed-Form V-Trace Bias Correction for Delay-Aware RLHF

追溯优势纠正:针对延迟感知RLHF的封闭形式V-trace偏置校正

Learning to Throw: Agile and Accurate Cable-Suspended Payload Delivery with a Quadrotor

学习投掷:灵活且精准地用四旋翼悬挂电缆投放有效载荷

Qwen-Image-2.0-RL Technical Report

Qwen-Image-2.0-RL 技术报告

Training Observable Control Policies to Expose Agent State Through Actions

训练可观察控制策略,通过动作暴露代理状态

Yuvion LLM: An Adversarially-Aware Large Language Model for Content And AI Safety

Yuvion LLM:一个具攻击性意识的大型语言模型,用于内容与人工智能安全

MER-R1: Multimodal Emotion Reasoning via Slow-Fast Thinking Synergy

MER-R1:通过慢快思维协同进行多模态情绪推理

Textual Belief States for World Models: Identifiable Representation Learning Under Strict Mediation

世界模型中的文本信念状态:严格中介下的可识别表征学习

Learning to Reason with Curriculum II: Compositional Generalization

通过课程学习推理 II:作曲推广

BashCoder-R1: Towards Robust and Explainable Bash Code Generation with Robustness-Aware Group Relative Policy Optimization

BashCoder-R1:迈向具有鲁棒性感知群相对策略优化的稳健且可解释的Bash代码生成

ToE: A Hierarchical and Explainable Claim Verification Framework with Dynamic Multi-source Evidence Retrieval and Aggregation

ToE:一个具有动态多来源证据检索与聚合的层级且可解释的索赔验证框架

PerturbCellRL: Verifier-Guided Reinforcement Learning for Single-Cell Perturbation Prediction

PerturbCellRL:单细胞扰动预测的验证者引导强化学习

RS-Diffuser: Risk-Sensitive Diffusion Planning with Distributional Value Guidance

RS-扩散器:风险敏感扩散规划与分布价值指导

NormGuard: Reward-Preserving Norm Constraints in Flow-Matching Reinforcement Learning

NormGuard:流量匹配强化学习中的奖励保持规范约束

Booster Lab: A Data-Centric Pipeline for Learning Deployable Humanoid Locomotion Policies

Booster Lab:一个以数据为中心的可部署人形运动策略学习流程

ATOD: Annealed Turn-aware On-policy Distillation for Multi-turn Autonomous Agents

ATOD:多回合自主代理退火的回合感知策略蒸馏

PPO-EAL: Exact Augmented Lagrangian Proximal Policy Optimization for Safe Robotic Control

PPO-EAL:精确增强拉格朗日近端策略优化,用于安全机器人控制

From Bootstrapping to Sequence Modeling: A Unified Generative Framework for Personalized Landing-Page Modeling

从自助法到序列建模:个性化着陆页建模的统一生成框架

Reflect-R1: Evidence-Driven Reflection for Self-Correction in Long Video Understanding

Reflect-R1:长视频理解中自我纠正的循证反思

Verifiable Geometry Problem Solving: Solver-Driven Autoformalization and Theorem Proposing

可验证几何问题解决:求解器驱动的自形式化与定理提出

Two-Stage Fine-Tuning for Protein Sequence Generation with Targeted Amino-Acid Composition

针对靶向氨酸组成的蛋白质序列生成的两阶段微调

TempAct: Advancing Temporal Plausibility in Autoregressive Video Generation via Planner-Executor RL

TempAct:通过Planner-Executor RL推进自回归视频生成的时间合理性

Regularized Reward-Punishment Reinforcement Learning

正规化奖励-惩罚强化学习

Tandem Reinforcement Learning with Verifiable Rewards

带可验证奖励的双人强化学习

Learning Stable In-Grasp Manipulation in a Non-Dropping Action Space

在非掉落动作空间中学习稳定的抓握操作

Keyword: diffusion policy

There is no result