生成时间: 2026-06-16 20:58:41 (UTC+8); Arxiv 发布时间: 2026-06-16 20:00 EDT (2026-06-17 08:00 UTC+8)

今天共有 78 篇相关文章

Keyword: reinforcement learning

Efficient Reinforcement for Visual-Textual Thinking with Discrete Diffusion Model

利用离散扩散模型高效强化视觉-文本思维

QPILOTS: Efficient Test-Time Q-Steering for Flow Policies

QPILOTS:高效的流量策略测试时Q引导

CoRA: Confidence-Rationale Alignment for Reliable Chain-of-Thought Reasoning

CoRA:可靠思维链推理的信心-理据对齐

Nemotron 3 Ultra: Open, Efficient Mixture-of-Experts Hybrid Mamba-Transformer Model for Agentic Reasoning

Nemotron 3 Ultra:开放高效的专家混合混合曼巴-变换器模型,用于代理推理

Temporal Difference Learning for Diffusion Models

扩散模型中的时间差分学习

Towards Ubiquitous 6G Computing and Networking Convergence: Architecture and Mechanism for Cross-Domain Resource Coordination

迈向无处不在的6G计算与网络融合:跨域资源协调的架构与机制

Ling and Ring 2.6 Technical Report: Efficient and Instant Agentic Intelligence at Trillion-Parameter Scale

灵与环 2.6 技术报告:万亿参数尺度下的高效即时智能智能

Think Less, Act Early: Reinforced Latent Reasoning with Early Exit in Vision-Language-Action Models

少想早行动:视觉-语言-行动模型中强化潜在推理与早期退出

DLWM: Diverse Latent World Models for Efficient Multimodal Reasoning

DLWM:多元潜在世界模型以实现高效多模态推理

StarOR: Synergizing Tree Search and Test-Time Reinforcement Learning for Optimization Modeling

StarOR:协同树搜索与测试时强化学习用于优化建模

SPARK: Spatial Policy-driven Adaptive Reinforcement learning for Knowledge distillation

SPARK:空间政策驱动的自适应强化学习,用于知识蒸馏

Exploring Starts Are Not Enough: Counterexamples and a Fix for Monte Carlo Exploring Starts

探索开局还不够:反例与修复蒙特卡洛探索开局

Trust-Region Diffusion Policies for Massively Parallel On-Policy RL

大规模并行策略强化学习的信任区域扩散策略

Discovering Lattice Reduction Strategies via Self-Play

通过自我游戏学习减少格点策略

Adapting Reinforcement Learning with Chain-of-Thought Supervision for Explainable Detection of Hateful and Propagandistic Memes

采用思维链监督的强化学习,以可解释地检测仇恨和宣传表情包

Hamilton-Jacobi Reachability-Based Safe Reinforcement Learning for Emergency Collision Avoidance

基于可达性的安全强化学习用于紧急碰撞避免

CausalDrive: Real-time Causal World Models for Autonomous Driving

CausalDrive:自动驾驶的实时因果世界模型

Reward Hacking in Language Model Agents: Revisiting AI Safety Gridworlds

语言模型代理中的奖励黑客:重新审视人工智能安全网格世界

Defending against Adaptive Prompt Injection Attacks via Reasoning-enabled Task Alignment

通过推理驱动的任务对齐防御自适应提示注入攻击

Understanding Diversity Collapse in RLVR via the Lens of Overtraining

从过度训练的视角理解RLVR中的多样性崩溃

Reinforcement Learning-Guided Retrieval with Soft Fusion for Robust Multimodal Imitation Learning under Missing Modalities

基于软融合的强化学习引导检索,在缺失模态下实现稳健的多模态模仿学习

Localizing Credit at the Divergence: Path-Conditioned Self-Distillation for LLM Reasoning

在分歧处定位信用:用于LLM推理的路径条件自蒸馏

Agentic Retrieval and Reinforcement Learned Equation Chains: A Controlled Generation Framework for Complex and Novel Physics Word Problems

能动检索与强化学习方程链:复杂新颖物理应用题的受控生成框架

Self-Questioning Vision-Language Models: Reinforcement Learning for Compositional Visual Reasoning

自我质疑视觉语言模型:构图视觉推理的强化学习

Proximal Policy Optimization for Amortized Discrete Sampling

摊销离散采样的近端策略优化

FlashNav: Ultra-Fast Policy Training for Robot Navigation within 20 Seconds

FlashNav:机器人导航超快速政策培训,20秒内完成

STRIDE: Strategic Trajectory Reasoning via Discriminative Estimation for Verifiable Reinforcement Learning

STRIDE:通过判别性估计进行战略轨迹推理,实现可验证的强化学习

BALTO: Balanced Token-Level Policy Optimization for Hallucination Mitigation

BALTO:平衡的令牌级策略优化以缓解幻觉

Reinforcement Learning for LLM-based Event Forecasting

基于LLM的事件预测的强化学习

Energy-Efficient Arm Reaching for a Humanoid Robot via Deep Reinforcement Learning with Identified Power Models

通过深度强化学习与已识别功率模型实现节能臂对类人机器人的伸手

OmniOPSD: Rationale-Privileged On-Policy Self-Distillation for Affective Computing

OmniOPSD:基于理性特权的政策自我提炼,用于情感计算

Artificial Intelligence for Power-Converter-Rich Electrical Systems: A Review

电力转换器富丰富电力系统的人工智能:综述

Thinking with Visual Grounding

以视觉为基础思考

VibeThinker-3B: Exploring the Frontier of Verifiable Reasoning in Small Language Models

VibeThinker-3B:探索小语言模型中可验证推理的前沿

GRACE: Step-Level Benchmark for Faithful Reasoning over Context

GRACE:基于上下文的忠实推理的步骤级基准

A Gradient Perspective on RLVR Stability and Winner Advantage Policy Optimization

RLVR稳定性与赢家优势政策优化的梯度视角

Binary Decompilation LLM with Feedback-Driven Multi-Turn Refinement

带有反馈驱动多回合精炼的二元反编译大型语言模型

PACT: Privileged Trace Co-Training for Multi-Turn Tool-Use Agents

PACT:多轮工具使用代理的特权追踪共训

Graphical conditional generative modeling for digital twin modeling

数字孪生建模的图形条件生成建模

Evolutionary Bilevel Reward Shaping for Generalization in Reinforcement Learning

强化学习中泛化的进化双层奖励塑造

TopoRetarget: Interaction-Preserving Retargeting for Dexterous Manipulation

TopoRetarget:保持交互性的重定向,实现灵巧操作

An Adjoint-based Neural Regulator for Real-Time Optimal Control with State Constraints

基于伴随的神经调控器,用于带状态约束的实时最优控制

RL-Index: Reinforcement Learning for Retrieval Index Reasoning

RL-Index:用于检索索引推理的强化学习

Diffusion Offline Reinforcement Learning for Fair and Energy-Efficient UAV-Assisted Wireless Networks

扩散离线强化学习,实现公平且节能的无人机辅助无线网络

PathRouter: Aligning Rewards with Retrieval Quality in Agentic Graph Retrieval-Augmented Generation

PathRouter:在代理图检索增强生成中,如何将奖励与检索质量对齐

HOLO-MPPI: Multi-Scenario Motion Planning via Hierarchical Policy Optimization

HOLO-MPPI:通过层级策略优化实现多场景运动规划

BRICKS-WM: Building Reusability via Interface Composition Kinetics for Structured World Models

BRICKS-WM:通过界面组合动力学构建结构化世界模型的可重用性

daVinci-kernel: Co-Evolving Skill Selection, Summarization, and Utilization via RL for GPU Kernel Optimization

daVinci内核:通过强化学习共进化技能选择、总结与利用以实现GPU内核优化

How Post-Training Shapes Biological Reasoning Models

后训练如何塑造生物推理模型

Incentives and Evidence in Learned Service Orchestration

学习服务编排中的激励与证据

ROSA-RL: Uncertainty-Aware Roundabout Optimized Speed Advisory with Reinforcement Learning

ROSA-RL:不确定性感知环岛优化速度咨询与强化学习

Steering Generative Reinforcement Learning into Stable Robotic Controller

引导生成强化学习进入稳定机器人控制器

Infant Spontaneous Movement Noise Improves Exploration in Deep RL

婴儿自发运动噪音改善深层强化学习的探索能力

DifferAD-R1: A Difference-Guided IndustrialAnomaly Localization with Multimodal LargeLanguage Models

DifferAD-R1:一种基于多模态大语言模型的差导工业异常定位

Reinforcement Learning with Inner-loop Dynamics Estimator for Aerial Manipulation under Uncertainty

利用内环动力学估计器进行不确定性下空中操控的强化学习

Understanding Automated Web GUI Testing: An Empirical Study Across Exploration Strategies and State Abstractions

理解自动化网页图形界面测试:跨探索策略与状态抽象的实证研究

VENOM: Versatile Embodied Network for Omni-bodied Motion tracking

VENOM:多功能具体网络,用于全体运动追踪

Harmonizing Semantic and Collaborative in LLMs: Reasoning-based Embedding Generator for Sequential Recommendation

大语言模型中的语义与协作协调:基于推理的顺序推荐嵌入生成器

Medical world models: representing medical states, modelling clinical dynamics and guiding intervention policies

医学世界模型:代表医疗状态,建模临床动态并指导干预政策

Pride and Prejudice: Toward an Information-Theoretic Framework for Mutually Communicative Driver Behavior Modeling

《傲慢与偏见:迈向信息理论框架的相互交际驱动行为建模》

Maximum Entropy Inverse Reinforcement Learning for Mean-Field Games with Average Reward

平均奖励均值场博弈的最大熵逆强化学习

GD$^2$PO: Mitigating Multi-Reward Conflicts via Group-Dynamic reward-Decoupled Policy Optimization

GD$^2$PO:通过群体动态奖励解耦策略优化缓解多奖励冲突

OpenClaw-Skill: Collective Skill Tree Search for Agentic Large Language Models

OpenClaw-Skill:agentic 大型语言模型的集体技能树搜索

Understanding the Behaviors of Environment-aware Information Retrieval

理解环境感知信息检索的行为

Deep Q-Learning on Hölder Spaces

Hölder 空间上的深度 Q 学习

Video-Based Optimal Transport for Feedback-Efficient Offline Preference-Based Reinforcement Learning

基于视频的反馈高效离线偏好强化学习的最优传输

Latent Space Reinforcement Learning for Inverse Material Estimation in Food Fracture Simulation

潜在空间强化学习用于食物断裂模拟中的反材料估计

Greed Is Learned: Visible Incentives as Reward-Hacking Triggers

贪婪是学习的:可见的激励作为奖励黑客的触发因素

A Unified Causal-Origin Taxonomy of Distributional Shifts in Reinforcement Learning

强化学习中分布转变的统一因果-起源分类法

Exploring Extrinsic and Intrinsic Properties for Effective Reasoning with Code Interpreter

探索使用代码解释器进行有效推理的外在性和内在属性

Task-Error Residual Learning for Real-Robot Five-Ball Juggling

真实机器人五球杂耍的任务误差残差学习

DreamX-World 1.0: A General-Purpose Interactive World Model

DreamX-World 1.0:通用交互世界模型

When in Doubt, Plan It Out: Committed Small Language Model Deliberation for Reactive Reinforcement Learning

有疑问时,就规划出来:针对反应强化学习的承诺小语言模型审议

ROVE: Unlocking Human Interventions for Humanoid Manipulation via Reinforcement Learning

ROVE:通过强化学习解锁人类干预以实现类人生物操控

ExpRL: Exploratory RL for LLM Mid-Training

ExpRL:探索性强化学习,面向LLM中期培训

DEEPRUBRIC: Evidence-Tree Rubric Supervision for Efficient Reinforcement Learning of Deep Research Agents

DEEPRUBRIC:深层研究代理高效强化学习的证据树评分标准

Context-Aware RL for Agentic and Multimodal LLMs

适用于代理型和多模态大型语言模型的上下文感知强化学习

The Value Axis: Language Models Encode Whether They're on the Right Track

价值轴:语言模型编码它们是否走在正确的道路上

Keyword: diffusion policy

There is no result