生成时间: 2026-06-18 19:52:58 (UTC+8); Arxiv 发布时间: 2026-06-18 20:00 EDT (2026-06-19 08:00 UTC+8)

今天共有 30 篇相关文章

Keyword: reinforcement learning

Breaking the Solver Bottleneck: Training Task Generators at the Learnable Frontier

打破求解器瓶颈:可学习前沿的训练任务生成器

Self-CTRL: Self-Consistency Training with Reinforcement Learning

自我控制:强化学习中的自我一致性训练

Recover, Discover, Plan: Learning Skills and Concepts from Robot Failures

恢复、发现、计划:从机器人故障中学习技能和概念

Reasoning as Intersection: Consensus-Frame Alignment for Visual Focus in Video-MLLMs

作为交叉点的推理:视频多层次多层次语言营销中视觉聚焦的共识-框架对齐

Structured Representation Learning with Locally Linear Embeddings and Adaptive Feature Fusion

结构化表示学习,采用局部线性嵌入和自适应特征融合

N(CO)$^2$: Neural Combinatorial Optimization with Chance Constraints to Solve Stochastic Orienteering

N(CO)$^2$:带偶然约束的神经组合优化以解决随机定向

Sparsity Curse: Understanding RLVR Model Parameter Space from Model Merging

稀疏性诅咒:通过模型合并理解RLVR模型参数空间

Benchmarking Action Spaces in Reinforcement Learning for Vision-based Robotic Manipulation

基于视觉的机器人操作强化学习中的行动空间基准测试

PragReST: Self-Reinforcing Counterfactual Reasoning for Pragmatic Language Understanding

PragReST:语用语言理解的自我强化反事实推理

SRL: Combining SLIP Model and Reinforcement Learning for Agile Robotic Jumping

SRL:结合SLIP模型与强化学习实现敏捷机器人跳跃

Generating Natural and Expressive Robot Gestures through Iterative Reinforcement Learning with Human Feedback using LLMs

通过利用大型语言模型(LLM)进行迭代强化学习与人类反馈,生成自然且富有表现力的机器人手势

R2D-RL: A RoboCup 2D Soccer Environment for Multi-Agent Reinforcement Learning

R2D-RL:一个用于多智能体强化学习的RoboCup 2D足球环境

Learning from Own Solutions: Self-Conditioned Credit Assignment for Reinforcement Learning with Verifiable Rewards

自我学习:强化学习的自我条件学分作业,并可验证奖励

Reinforcement Learning Foundation Models Should Already Be A Thing

强化学习基础模型本应已经存在

Maturing Markov Decision Processes: Decision Making under Increasing Information and Shrinking Action Sets

成熟的马尔可夫决策过程:信息量增加和行动集缩小下的决策

Beyond Reward Engineering: A Data Recipe for Long-Context Reinforcement Learning

超越奖励工程:长上下文强化学习的数据配方

Generative-Model Predictive Planning for Navigation in Partially Observable Environments

部分可观测环境中导航的生成模型预测规划

REVES: REvision and VErification--Augmented Training for Test-Time Scaling

REVES:修订与验证——测试时间扩展训练

Object-Centric Residual RL for Zero-Shot Sim-to-Real VLA Enhancement

面向对象的残留强化学习,用于零时模拟到真实的VLA增强

GraphPO: Graph-based Policy Optimization for Reasoning Models

GraphPO:基于图的推理模型策略优化

EfficientRollout: System-Aware Self-Speculative Decoding for RL Rollouts

高效推广:系统感知的强化学习自推测解码

Spotlight: Synergizing Seed Exploration and Spot GPUs for DiT RL Post-Training

聚焦:协同种子探索与点GPU用于DiT强化学习后培训

ProductConsistency: Improving Product Identity Preservation in Instruction-Based Image Editing via SFT and RL

ProductConsistency:通过SFT和RL改进基于指令的图像编辑中的产品身份保护

Pareto Q-Learning with Reward Machines

用奖励机进行帕累托Q学习

The Reward Was in Your Data All Along: Correcting Flow Matching with Discriminator-Guided RL

奖励一直都在你的数据中:用判别器引导的强化学习纠正流量匹配

Forecasting what Matters: Decision-Focused RL for Controlled EV Charging with Unknown Departure Times

预测重要因素:决策导向的强化学习,用于受控电动汽车充电,且出发时间未知

STARE: Surprisal-Guided Token-Level Advantage Reweighting for Policy Entropy Stability

STARE:惊人引导的代币级优势重权重调整以提升政策熵稳定性

Rethinking Reward Supervision: Rubric-Conditioned Self-Distillation

重新思考奖励监督:评分标准条件自我提炼

Learning User Simulators with Turing Rewards

学习用户模拟器与图灵奖励

Native Active Perception as Reasoning for Omni-Modal Understanding

作为全模态理解的天赋主动知觉

Keyword: diffusion policy

There is no result