生成时间: 2026-01-15 16:33:52 (UTC+8); Arxiv 发布时间: 2026-01-15 20:00 EST (2026-01-16 09:00 UTC+8)

今天共有 20 篇相关文章

Keyword: reinforcement learning

Reading or Reasoning? Format Decoupled Reinforcement Learning for Document OCR

阅读还是推理?文档OCR格式解耦强化学习

TranslateGemma Technical Report

TranslateGemma 技术报告

SRT: Accelerating Reinforcement Learning via Speculative Rollout with Tree-Structured Cache

SRT:通过树状缓存的推测性推广加速强化学习

SkinFlow: Efficient Information Transmission for Open Dermatological Diagnosis via Dynamic Visual Encoding and Staged RL

SkinFlow:通过动态视觉编码和分阶段强化学习实现开放式皮肤诊断的高效信息传输

UserLM-R1: Modeling Human Reasoning in User Language Models with Multi-Reward Reinforcement Learning

UserLM-R1:利用多奖励强化学习建模用户语言模型中的人类推理

GIFT: Unlocking Global Optimality in Post-Training via Finite-Temperature Gibbs Initialization

GIFT:通过有限温度吉布斯初始化解锁训练后全局最优性

Reward Learning through Ranking Mean Squared Error

通过均方误差排名奖励学习

Efficient Paths and Dense Rewards: Probabilistic Flow Reasoning for Large Language Models

高效路径与密集奖励:大型语言模型的概率流推理

Learning to Trust Experience: A Monitor-Trust-Regulator Framework for Learning under Unobservable Feedback Reliability

学习信任经验:一个在不可观察反馈可靠性下学习的监控-信任-监管框架

RISER: Orchestrating Latent Reasoning Skills for Adaptive Activation Steering

RISER:为适应激活引导调控潜在推理技能

Enhancing Spatial Reasoning in Large Language Models for Metal-Organic Frameworks Structure Prediction

增强金属有机框架结构预测中大型语言模型中的空间推理能力

Policy-Based Reinforcement Learning with Action Masking for Dynamic Job Shop Scheduling under Uncertainty: Handling Random Arrivals and Machine Failures

基于策略的强化学习与动作掩蔽,用于动态作业车间调度(在不确定性下):处理随机到达和机器故障

Monte-Carlo Tree Search with Neural Network Guidance for Lane-Free Autonomous Driving

蒙特卡洛树搜索与神经网络指导,实现无车道自动驾驶

GeoRA: Geometry-Aware Low-Rank Adaptation for RLVR

GeoRA:RLVR 的几何感知低阶适配

Semi-Contention-Free Access in IoT NOMA Networks: A Reinforcement Learning Framework

物联网NOMA网络中的半无竞争接入:强化学习框架

Draw it like Euclid: Teaching transformer models to generate CAD profiles using ruler and compass construction steps

像Euclid那样画:教变压器模型用尺子和圆规构建步骤生成CAD剖面

Dialogue Telemetry: Turn-Level Instrumentation for Autonomous Information Gathering

对话遥测:自主信息收集的转向级仪器

DPWriter: Reinforcement Learning with Diverse Planning Branching for Creative Writing

DPWriter:创意写作的强化学习与多元规划分支

Collaborative Multi-Agent Test-Time Reinforcement Learning for Reasoning

协作多智能体测试时间强化学习推理

STEP3-VL-10B Technical Report

STEP3-VL-10B 技术报告

Keyword: diffusion policy

There is no result