生成时间: 2026-04-29 18:08:11 (UTC+8); Arxiv 发布时间: 2026-04-29 20:00 EDT (2026-04-30 08:00 UTC+8)

今天共有 36 篇相关文章

Keyword: reinforcement learning

Nautile-370M: Spectral Memory Meets Attention in a Small Reasoning Model

Nautile-370M:光谱记忆与注意力的结合,在一个小推理模型中

asRoBallet: Closing the Sim2Real Gap via Friction-Aware Reinforcement Learning for Underactuated Spherical Dynamics

asRoBallet:通过摩擦感知强化学习缩小 sim2Real 差距,实现欠致动球面动力学

Agentic AI for Remote Sensing: Technical Challenges and Research Directions

遥感智能人工智能:技术挑战与研究方向

Compute Aligned Training: Optimizing for Test Time Inference

计算对齐训练:优化测试时间推断

Sparse Personalized Text Generation with Multi-Trajectory Reasoning

多轨迹推理的稀疏个性化文本生成

Why Does Reinforcement Learning Generalize? A Feature-Level Mechanistic Study of Post-Training in Large Language Models

为什么强化学习具有普遍性?大型语言模型中后训练的特征级机制性研究

Zero Shot Coordination for Sparse Reward Tasks with Diverse Reward Shapings

零机会协调,针对稀疏奖励任务,且奖励形态多样

Evaluating Risks in Weak-to-Strong Alignment: A Bias-Variance Perspective

弱到强对齐风险评估:偏倚方差视角

Prior-Aligned Data Cleaning for Tabular Foundation Models

表格基础模型的先验对齐数据清理

CroSearch-R1: Better Leveraging Cross-lingual Knowledge for Retrieval-Augmented Generation

CroSearch-R1:更好地利用跨语言知识进行检索增强生成

How Can Reinforcement Learning Achieve Expert-level Placement?

强化学习如何实现专家级的定位?

OmniVTG: A Large-Scale Dataset and Training Paradigm for Open-World Video Temporal Grounding

OmniVTG:一个大规模数据集及开放世界视频时间基础训练范式

From Local Indices to Global Identifiers: Generative Reranking for Recommender Systems via Global Action Space

从本地索引到全球标识符:通过全球行动空间实现推荐系统生成式重新排序

Multi-action Tangled Program Graphs for Multi-task Reinforcement Learning with Continuous Control

多任务强化学习的多动作纠结程序图,采用持续控制

Safe-Support Q-Learning: Learning without Unsafe Exploration

安全支持Q学习:无安全探索的学习

Benchmarking and Improving GUI Agents in High-Dynamic Environments

高动态环境中的基准测试与改进图形界面代理

Biased Dreams: Limitations to Epistemic Uncertainty Quantification in Latent Space Models

偏见梦境:潜在空间模型中认识不确定性量化的局限性

JURY-RL: Votes Propose, Proofs Dispose for Label-Free RLVR

陪审团-RL:投票提案,证明废弃无标签RLVR

A Systematic Post-Train Framework for Video Generation

系统化的视频生成后列车框架

One Refiner to Unlock Them All: Inference-Time Reasoning Elicitation via Reinforcement Query Refinement

一个精炼器解锁所有这些:通过强化查询精炼引发推理时间推理

DDA-Thinker: Decoupled Dual-Atomic Reinforcement Learning for Reasoning-Driven Image Editing

DDA-Thinker:用于推理驱动图像编辑的解耦双原子强化学习

Improving Zero-Shot Offline RL via Behavioral Task Sampling

通过行为任务抽样改进零样本离线强化学习

SymphonyGen: 3D Hierarchical Orchestral Generation with Controllable Harmony Skeleton

SymphonyGen:带有可控和声骨架的三维层级管弦乐生成

Dyna-Style Safety Augmented Reinforcement Learning: Staying Safe in the Face of Uncertainty

Dyna式安全增强强化学习:在不确定性面前保持安全

Sample-efficient Neuro-symbolic Proximal Policy Optimization

样本高效神经符号近端策略优化

Egocentric Tactile and Proximity Sensors as Observation Priors for Humanoid Collision Avoidance

以自我为中心的触觉和接近传感器作为类人生物碰撞规避的观察先验

Modeling Human-Like Color Naming Behavior in Context

在情境中建模类人色彩命名行为

K-CARE: Knowledge-driven Symmetrical Contextual Anchoring and Analogical Prototype Reasoning for E-commerce Relevance

K-CARE:基于知识的对称上下文锚定与电子商务相关性的类比原型推理

Backtranslation Augmented Direct Preference Optimization for Neural Machine Translation

神经机器翻译的反向翻译增强直接偏好优化

QAROO: AI-Driven Online Task Offloading for Energy-Efficient and Sustainable MEC Networks

QAROO:以人工智能驱动的在线任务卸载,实现节能和可持续的MEC网络

EOS-Bench: A Comprehensive Benchmark for Earth Observation Satellite Scheduling

EOS-Bench:地球观测卫星调度的综合基准

KinDER: A Physical Reasoning Benchmark for Robot Learning and Planning

KinDER:机器人学习与规划的物理推理基准

When Errors Can Be Beneficial: A Categorization of Imperfect Rewards for Policy Gradient

错误何时可能有益:政策梯度不完全奖励的分类

Three Models of RLHF Annotation: Extension, Evidence, and Authority

RLHF注释的三种模型:扩展、证据和权威

TSN-Affinity: Similarity-Driven Parameter Reuse for Continual Offline Reinforcement Learning

TSN亲和力:相似度驱动参数重用用于持续离线强化学习

How Fast Should a Model Commit to Supervision? Training Reasoning Models on the Tsallis Loss Continuum

模特应该多快承诺监督?Tsallis 损失连续体上的训练推理模型

Keyword: diffusion policy

There is no result