生成时间: 2026-02-12 16:53:04 (UTC+8); Arxiv 发布时间: 2026-02-12 20:00 EST (2026-02-13 09:00 UTC+8)

今天共有 47 篇相关文章

Keyword: reinforcement learning

Multimodal Information Fusion for Chart Understanding: A Survey of MLLMs -- Evolution, Limitations, and Cognitive Enhancement

多模态信息融合用于图表理解:MLLM综述——演变、局限性与认知增强

Internalizing Meta-Experience into Memory for Guided Reinforcement Learning in Large Language Models

将元经验内化为记忆,用于大型语言模型中的引导强化学习

Learning to Evict from Key-Value Cache

学习从键值缓存中驱逐

The Role of Learning in Attacking Intrusion Detection Systems

学习在攻击入侵检测系统中的作用

Confounding Robust Continuous Control via Automatic Reward Shaping

通过自动奖励塑造混淆稳健连续控制

Are More Tokens Rational? Inference-Time Scaling in Language Models as Adaptive Resource Rationality

更多的代币合理吗?作为自适应资源理性,语言模型中的推理时间尺度

Efficient Policy Adaptation for Voltage Control Under Unknown Topology Changes

在未知拓扑变化下,电压控制的高效策略适配

Autonomous Continual Learning of Computer-Use Agents for Environment Adaptation

环境适应中计算机使用代理的自主持续学习

Breaking the Curse of Repulsion: Optimistic Distributionally Robust Policy Optimization for Off-Policy Generative Recommendation

打破排斥的诅咒:乐观分布稳健的策略优化,用于非策略生成推荐

Control Reinforcement Learning: Token-Level Mechanistic Analysis via Learned SAE Feature Steering

控制强化学习:通过学习SAE特征引导进行令牌级机制分析

AudioRouter: Data Efficient Audio Understanding via RL based Dual Reasoning

AudioRouter:通过基于强化学习的双重推理实现数据高效音频理解

Found-RL: foundation model-enhanced reinforcement learning for autonomous driving

Found-RL:基于基础模型增强的自动驾驶强化学习

Towards Long-Lived Robots: Continual Learning VLA Models via Reinforcement Fine-Tuning

迈向长寿命机器人:通过强化微调持续学习VLA模型

Prioritize the Process, Not Just the Outcome: Rewarding Latent Thought Trajectories Improves Reasoning in Looped Language Models

优先考虑过程,而不仅仅是结果:奖励潜在的思维轨迹能提升循环语言模型中的推理能力

What Makes Value Learning Efficient in Residual Reinforcement Learning?

是什么让价值学习在残余强化学习中高效?

ReSPEC: A Framework for Online Multispectral Sensor Reconfiguration in Dynamic Environments

ReSPEC:动态环境中在线多光谱传感器重配置的框架

SplitCom: Communication-efficient Split Federated Fine-tuning of LLMs via Temporal Compression

SplitCom:通过时间压缩实现通信高效的分流联合微调大型语言模型

MetaphorStar: Image Metaphor Understanding and Reasoning with End-to-End Visual Reinforcement Learning

MetaphorStar:图像隐喻理解与推理,结合端到端视觉强化学习

LLM-Based Scientific Equation Discovery via Physics-Informed Token-Regularized Policy Optimization

基于LLM的科学方程发现,通过物理知情的令牌正则化策略优化

Neuro-symbolic Action Masking for Deep Reinforcement Learning

深度强化学习中的神经符号动作掩蔽

Step 3.5 Flash: Open Frontier-Level Intelligence with 11B Active Parameters

步骤3.5 闪现:开放边境级智能,配备11B激活参数

Online Causal Kalman Filtering for Stable and Effective Policy Optimization

在线因果卡尔曼过滤,实现稳定有效的策略优化

Mitigating Reward Hacking in RLHF via Bayesian Non-negative Reward Modeling

通过贝叶斯非负奖励建模缓解RLHF中的奖励黑客行为

OmniVL-Guard: Towards Unified Vision-Language Forgery Detection and Grounding via Balanced RL

OmniVL-Guard:迈向通过平衡强化学习实现统一视觉语言伪造检测与接地

VESPO: Variational Sequence-Level Soft Policy Optimization for Stable Off-Policy LLM Training

VESPO:变分序列级软策略优化,用于稳定非策略大型语言模型训练

Spend Search Where It Pays: Value-Guided Structured Sampling and Optimization for Generative Recommendation

“花钱搜索”,“有价值”:以价值为导向的结构化抽样与生成式推荐优化

Reinforced Curriculum Pre-Alignment for Domain-Adaptive VLMs

领域适应性VLM的强化课程预对齐

Dynamic Interference Management for TN-NTN Coexistence in the Upper Mid-Band

上中频带中TN-NTN共存的动态干扰管理

Why Does RL Generalize Better Than SFT? A Data-Centric Perspective on VLM Post-Training

为什么强化学习比SFT更能推广?以数据为中心的VLM后培训视角

RePO: Bridging On-Policy Learning and Off-Policy Knowledge through Rephrasing Policy Optimization

RePO:通过重述策略优化,桥接政策内学习与非策略知识

SimuScene: Training and Benchmarking Code Generation to Simulate Physical Scenarios

SimuScene:模拟物理场景的代码生成训练与基准测试

ICA: Information-Aware Credit Assignment for Visually Grounded Long-Horizon Information-Seeking Agents

ICA:视觉基础、远视线信息寻求代理的信息感知信用分配

Chart Specification: Structural Representations for Incentivizing VLM Reasoning in Chart-to-Code Generation

图表规范:用于激励图表到代码生成中VLM推理的结构表示

Resource-Efficient Model-Free Reinforcement Learning for Board Games

资源高效的无模型强化学习桌面游戏

Near-Constant Strong Violation and Last-Iterate Convergence for Online CMDPs via Decaying Safety Margins

在线CMDPs几乎持续强烈的违规和最后一次趋同,安全余量下降

Multi-Task Reinforcement Learning of Drone Aerobatics by Exploiting Geometric Symmetries

利用几何对称性实现无人机特技飞行多任务强化学习

Fine-Tuning GPT-5 for GPU Kernel Generation

GPT-5 GPU内核生成的微调

Divide, Harmonize, Then Conquer It: Shooting Multi-Commodity Flow Problems with Multimodal Language Models

分、和谐、然后征服它:用多模态语言模型解决多商品流问题

Simultaneous Speech-to-Speech Translation Without Aligned Data

无对齐数据的语音转语音同步翻译

Chatting with Images for Introspective Visual Thinking

与图片聊天以促进内省视觉思维

RISE: Self-Improving Robot Policy with Compositional World Model

RISE:基于合成世界模型的自我改进机器人政策

Interpretable Attention-Based Multi-Agent PPO for Latency Spike Resolution in 6G RAN Slicing

基于注意力的多代理PPO,用于6G RAN切片中的延迟尖峰解析

DataChef: Cooking Up Optimal Data Recipes for LLM Adaptation via Reinforcement Learning

DataChef:通过强化学习为LLM适配打造最佳数据配方

Safety Recovery in Reasoning Models Is Only a Few Early Steering Steps Away

推理模型中的安全恢复只需几步早期引导

Asymmetric Prompt Weighting for Reinforcement Learning with Verifiable Rewards

非对称提示加权用于可验证奖励的强化学习

Data-Efficient Hierarchical Goal-Conditioned Reinforcement Learning via Normalizing Flows

通过归一化流程实现的数据高效层级目标条件强化学习

APEX: Learning Adaptive High-Platform Traversal for Humanoid Robots

APEX:学习人形机器人的自适应高平台移动

Keyword: diffusion policy

There is no result