生成时间: 2026-04-23 17:47:20 (UTC+8); Arxiv 发布时间: 2026-04-23 20:00 EDT (2026-04-24 08:00 UTC+8)

今天共有 35 篇相关文章

Keyword: reinforcement learning

OThink-SRR1: Search, Refine and Reasoning with Reinforced Learning for Large Language Models

OThink-SRR1:大型语言模型中的搜索、精炼与推理,基于强化学习

PR-CAD: Progressive Refinement for Unified Controllable and Faithful Text-to-CAD Generation with Large Language Models

PR-CAD:基于大型语言模型实现统一、可控且忠实的文本转CAD生成的渐进优化

Wan-Image: Pushing the Boundaries of Generative Visual Intelligence

Wan-image:推动生成式视觉智能的边界

DR-Venus: Towards Frontier Edge-Scale Deep Research Agents with Only 10K Open Data

DR-Venus:迈向仅有1万个开放数据的前沿边缘级深度研究代理

Infection-Reasoner: A Compact Vision-Language Model for Wound Infection Classification with Evidence-Grounded Clinical Reasoning

感染推理器:一个基于循证临床推理的紧凑视觉-语言模型用于伤口感染分类

Visual Reasoning through Tool-supervised Reinforcement Learning

通过工具监督强化学习实现视觉推理

Efficient Reinforcement Learning using Linear Koopman Dynamics for Nonlinear Robotic Systems

利用线性库普曼动力学实现非线性机器人系统的高效强化学习

Multi-Objective Reinforcement Learning for Generating Covalent Inhibitor Candidates

多目标强化学习用于生成共价抑制剂候选物

Bootstrapping Post-training Signals for Open-ended Tasks via Rubric-based Self-play on Pre-training Text

通过基于评分标准的自我游戏,自助构建开放式任务的后期训练信号

Maximum Entropy Semi-Supervised Inverse Reinforcement Learning

最大熵半监督逆强化学习

On the Stability and Generalization of First-order Bilevel Minimax Optimization

关于一阶双层极大优化的稳定性与推广

SAKE: Self-aware Knowledge Exploitation-Exploration for Grounded Multimodal Named Entity Recognition

SAKE:自觉知识利用——基于多模态命名实体识别的探索

Toward Safe Autonomous Robotic Endovascular Interventions using World Models

迈向利用世界模型实现安全的自主机器人血管内干预

Temporally Extended Mixture-of-Experts Models

时间扩展专家混合模型

Lever: Inference-Time Policy Reuse under Support Constraints

杠杆:支持约束下的推理时间策略重用

RADS: Reinforcement Learning-Based Sample Selection Improves Transfer Learning in Low-resource and Imbalanced Clinical Settings

RADS:基于强化学习的样本选择在低资源和不平衡临床环境中改善迁移学习

TL-RL-FusionNet: An Adaptive and Efficient Reinforcement Learning-Driven Transfer Learning Framework for Detecting Evolving Ransomware Threats

TL-RL-FusionNet:一个自适应且高效的强化学习驱动迁移学习框架,用于检测不断演变的勒索软件威胁

X-Cache: Cross-Chunk Block Caching for Few-Step Autoregressive World Models Inference

X缓存:用于少数步自回归世界模型推断的跨区块缓存

ETac: A Lightweight and Efficient Tactile Simulation Framework for Learning Dexterous Manipulation

ETac:一款轻便高效的触觉模拟框架,用于学习灵巧操作

Hybrid Latent Reasoning with Decoupled Policy Optimization

混合潜在推理与解耦策略优化

Distributional Value Estimation Without Target Networks for Robust Quality-Diversity

无目标网络的分布值估计,实现稳健的质量多样性

WebGen-R1: Incentivizing Large Language Models to Generate Functional and Aesthetic Websites with Reinforcement Learning

WebGen-R1:激励大型语言模型通过强化学习生成功能性和美观的网站

Temporal Difference Calibration in Sequential Tasks: Application to Vision-Language-Action Models

序列任务中的时间差分校准:在视觉-语言-行动模型中的应用

Video-ToC: Video Tree-of-Cue Reasoning

视频目录:视频提示树推理

ProMMSearchAgent: A Generalizable Multimodal Search Agent Trained with Process-Oriented Rewards

ProMMSearchAgent:一款可推广的多模态搜索代理,采用过程导向奖励训练

Where Reasoning Breaks: Logic-Aware Path Selection by Controlling Logical Connectives in LLMs Reasoning Chains

推理破碎之处:通过控制逻辑连接词在大型语言模型推理链中实现逻辑感知路径选择

Ask Only When Needed: Proactive Retrieval from Memory and Skills for Experience-Driven Lifelong Agents

仅在需要时提问:经验驱动的终身代理人的主动记忆与技能检索

A Hierarchical MARL-Based Approach for Coordinated Retail P2P Trading and Wholesale Market Participation of DERs

基于MARL的层级方法,协调零售P2P交易和DERs批发市场参与

GRPO-VPS: Enhancing Group Relative Policy Optimization with Verifiable Process Supervision for Effective Reasoning

GRPO-VPS:通过可验证的过程监督增强群体相对策略优化,实现有效推理

MGDA-Decoupled: Geometry-Aware Multi-Objective Optimisation for DPO-based LLM Alignment

MGDA-解耦:基于DPO的多目标多目标优化 LLM

SSL-R1: Self-Supervised Visual Reinforcement Post-Training for Multimodal Large Language Models

SSL-R1:多模态大型语言模型的自监督视觉强化训练后

Visual-Tactile Peg-in-Hole Assembly Learning from Peg-out-of-Hole Disassembly

视觉触觉钉入洞组装:从钉子脱孔拆解中学习

Near-Future Policy Optimization

近未来政策优化

V-tableR1: Process-Supervised Multimodal Table Reasoning with Critic-Guided Policy Optimization

V-tableR1:过程监督多模态表推理与批判者引导策略优化

ParetoSlider: Diffusion Models Post-Training for Continuous Reward Control

ParetoSlider:连续奖励控制的后训练扩散模型

Keyword: diffusion policy

There is no result