生成时间: 2026-02-11 16:53:50 (UTC+8); Arxiv 发布时间: 2026-02-11 20:00 EST (2026-02-12 09:00 UTC+8)

今天共有 46 篇相关文章

Keyword: reinforcement learning

UI-Venus-1.5 Technical Report

UI-Venus-1.5 技术报告

An Actor-Critic-Identifier Control Design for Increasing Energy Efficiency of Automated Electric Vehicles

一种用于提高自动电动车能效的演员-批评者-标识符控制设计

Boltzmann Reinforcement Learning for Noise resilience in Analog Ising Machines

模拟伊辛机噪声韧性的玻尔兹曼强化学习

$n$-Musketeers: Reinforcement Learning Shapes Collaboration Among Language Models

$n$-火枪手:强化学习塑造语言模型间的协作

EExApp: GNN-Based Reinforcement Learning for Radio Unit Energy Optimization in 5G O-RAN

EExApp:基于GNN的增强学习用于5G O-RAN无线单元能量优化

CausalGDP: Causality-Guided Diffusion Policies for Reinforcement Learning

因果GDP:基于因果律的扩散策略用于强化学习

Risk-sensitive reinforcement learning using expectiles, shortfall risk and optimized certainty equivalent risk

利用预期值、短缺风险和优化确定性等效风险的风险敏感强化学习

Reward Modeling for Reinforcement Learning-Based LLM Reasoning: Design, Challenges, and Evaluation

基于强化学习的LLM推理中的奖励建模:设计、挑战与评估

CAPER: Constrained and Procedural Reasoning for Robotic Scientific Experiments

CAPER:机器人科学实验的受限与程序性推理

Squeezing More from the Stream : Learning Representation Online for Streaming Reinforcement Learning

从流中榨取更多:流式强化学习的在线学习表示

SceneReVis: A Self-Reflective Vision-Grounded Framework for 3D Indoor Scene Synthesis via Multi-turn RL

SceneReVis:一种基于自反思的视觉基础框架,通过多回合强化学习实现室内3D场景合成

P1-VL: Bridging Visual Perception and Scientific Reasoning in Physics Olympiads

P1-VL:物理奥林匹克竞赛中视觉感知与科学推理的桥梁

SpotAgent: Grounding Visual Geo-localization in Large Vision-Language Models through Agentic Reasoning

SpotAgent:通过智能推理在大型视觉语言模型中扎根视觉地理定位

Online Learning in MDPs with Partially Adversarial Transitions and Losses

部分对抗性过渡和损失的MDP在线学习

Bridging Efficiency and Transparency: Explainable CoT Compression in Multimodal Large Reasoning Models

连接效率与透明度:多模大推理模型中的可解释CoT压缩

Where-to-Unmask: Ground-Truth-Guided Unmasking Order Learning for Masked Diffusion Language Models

揭露的地点:基于真实基础的掩蔽式解密顺序学习,适用于掩蔽扩散语言模型

Training deep physical neural networks with local physical information bottleneck

训练深度物理神经网络,利用局部物理信息瓶颈

Rollout-Training Co-Design for Efficient LLM-Based Multi-Agent Reinforcement Learning

基于LLM的高效多智能体强化学习的推广-培训联合设计

On the Optimal Reasoning Length for RL-Trained Language Models

关于强化学习训练语言模型的最佳推理长度

Learning from the Irrecoverable: Error-Localized Policy Optimization for Tool-Integrated LLM Reasoning

从不可逆转的困境中学习:错误局部策略优化以实现工具集成的大型语言模型推理

Directed Information: Estimation, Optimization and Applications in Communications and Causality

定向信息:估计、优化及通信与因果关系中的应用

ExO-PPO: an Extended Off-policy Proximal Policy Optimization Algorithm

ExO-PPO:一种扩展的非策略近端策略优化算法

DiffuReason: Bridging Latent Reasoning and Generative Refinement for Sequential Recommendation

DiffuReason:连接潜在推理与生成精炼,实现顺序推荐

Grounding LTL Tasks in Sub-Symbolic RL Environments for Zero-Shot Generalization

在子符号强化环境中为零样子泛化奠定LTL任务基础

Diverse Skill Discovery for Quadruped Robots via Unsupervised Learning

通过无监督学习,四足机器人的多样化技能发现

Flexible Entropy Control in RLVR with Gradient-Preserving Perspective

RLVR中的灵活熵控制,采用保持梯度视角

A Controlled Study of Double DQN and Dueling DQN Under Cross-Environment Transfer

跨环境转移下双DQN与对抗DQN的受控研究

Code2World: A GUI World Model via Renderable Code Generation

Code2World:通过可渲染代码生成的图形界面世界模型

QP-OneModel: A Unified Generative LLM for Multi-Task Query Understanding in Xiaohongshu Search

QP-OneModel:一个统一生成式大型语言模型,用于小红书搜索中的多任务查询理解

ATTNPO: Attention-Guided Process Supervision for Efficient Reasoning

注意引导过程监督:高效推理的注意力引导过程监督

SCOPE: A Training-Free Online 3D Deployment for UAV-BSs with Theoretical Analysis and Comparative Study

SCOPE:无人机BS的免培训在线3D部署,结合理论分析与比较研究

ORCHID: Fairness-Aware Orchestration in Mission-Critical Air-Ground Integrated Networks

ORCHID:关键任务空地综合网络中的公平意识编排

ESTAR: Early-Stopping Token-Aware Reasoning For Efficient Inference

ESTAR:早期停止令牌感知推理以实现高效推理

Answer First, Reason Later: Aligning Search Relevance via Mode-Balanced Reinforcement Learning

先回答,后推理:通过模式平衡强化学习对齐搜索相关性

A Collaborative Safety Shield for Safe and Efficient CAV Lane Changes in Congested On-Ramp Merging

一个协作安全盾牌,用于在拥堵的匝道并入中安全高效地变换CAV车道

ADORA: Training Reasoning Models with Dynamic Advantage Estimation on Reinforcement Learning

ADORA:基于强化学习的动态优势估计训练推理模型

Resilient Topology-Aware Coordination for Dynamic 3D UAV Networks under Node Failure

节点故障下动态三维无人机网络的弹性拓扑感知协调

Fake-HR1: Rethinking reasoning of vision language model for synthetic image detection

Fake-HR1:重新思考合成图像检测中视觉语言模型的推理

Optimistic World Models: Efficient Exploration in Model-Based Deep Reinforcement Learning

乐观世界模型:基于模型的深度强化学习中的高效探索

Long Chain-of-Thought Compression via Fine-Grained Group Policy Optimization

通过细粒度群策略优化实现长链思维压缩

Features as Rewards: Scalable Supervision for Open-Ended Tasks via Interpretability

作为奖励的特点:通过可解释性实现开放式任务的可扩展监督

Anagent For Enhancing Scientific Table & Figure Analysis

增强科学表格与图表分析的分析工具

CODE-SHARP: Continuous Open-ended Discovery and Evolution of Skills as Hierarchical Reward Programs

CODE-SHARP:作为层级奖励计划,技能的持续开放式发现与进化

Agent World Model: Infinity Synthetic Environments for Agentic Reinforcement Learning

代理世界模型:用于代理强化学习的无限合成环境

Keyword: diffusion policy

CausalGDP: Causality-Guided Diffusion Policies for Reinforcement Learning

因果GDP:基于因果律的扩散策略用于强化学习

Preference Aligned Visuomotor Diffusion Policies for Deformable Object Manipulation

可变形物体作的偏好对齐维度驱动扩散策略