生成时间: 2026-02-10 17:02:14 (UTC+8); Arxiv 发布时间: 2026-02-10 20:00 EST (2026-02-11 09:00 UTC+8)

今天共有 94 篇相关文章

Keyword: reinforcement learning

The Optimal Token Baseline: Variance Reduction for Long-Horizon LLM-RL

最优代币基线:长期视野LLM-RL的方差缩减

Zero-Shot UAV Navigation in Forests via Relightable 3D Gaussian Splatting

通过可照明的3D高斯喷溅技术实现森林零发射无人机导航

Risk-Sensitive Exponential Actor Critic

风险敏感指数演员批评人

Cerebellar-Inspired Residual Control for Fault Recovery: From Inference-Time Adaptation to Structural Consolidation

小脑启发的故障恢复残差控制:从推断时间适应到结构巩固

Evolving LLM-Derived Control Policies for Residential EV Charging and Vehicle-to-Grid Energy Optimization

基于LLM的住宅电动汽车充电控制策略演进及车辆至电网能源优化

Optimizing Chlorination in Water Distribution Systems via Surrogate-assisted Neuroevolution

通过替代辅助神经进化优化供水系统中的氯化

Adaptive Scaffolding for Cognitive Engagement in an Intelligent Tutoring System

智能辅导系统中认知参与的自适应支架

High Fidelity Textual User Representation over Heterogeneous Sources via Reinforcement Learning

通过强化学习实现对异构来源的高保真文本用户表示

Meta-Reinforcement Learning for Robust and Non-greedy Control Barrier Functions in Spacecraft Proximity Operations

在航天器近距离作中实现稳健且非贪婪的控制屏障功能的元强化学习

Scalable Dexterous Robot Learning with AR-based Remote Human-Robot Interactions

基于增强现实的远程人机交互的可扩展灵巧机器人学习

Secure Code Generation via Online Reinforcement Learning with Vulnerability Reward Model

通过在线强化学习与脆弱性奖励模型实现安全代码生成

Proximal Action Replacement for Behavior Cloning Actor-Critic in Offline Reinforcement Learning

离线强化学习中行为克隆演员-批评者的近端动作替代

SpatialReward: Bridging the Perception Gap in Online RL for Image Editing via Explicit Spatial Reasoning

空间奖励:通过显式空间推理弥合在线强化学习图像编辑中的感知差距

SED-SFT: Selectively Encouraging Diversity in Supervised Fine-Tuning

SED-SFT:有选择性促进监督式微调的多样性

CoMI-IRL: Contrastive Multi-Intention Inverse Reinforcement Learning

CoMI-IRL:对比多意向逆向强化学习

Joint Reward Modeling: Internalizing Chain-of-Thought for Efficient Visual Reward Models

联合奖励建模:内化思维链以实现高效的视觉奖励模型

Unified Biomolecular Trajectory Generation via Pretrained Variational Bridge

通过预训练变分桥实现统一生物分子轨迹生成

Learning to Self-Verify Makes Language Models Better Reasoners

学会自我验证使语言模型成为更好的推理者

TeleBoost: A Systematic Alignment Framework for High-Fidelity, Controllable, and Robust Video Generation

TeleBoost:一个系统化的对齐框架,实现高保真、可控且强大的视频生成

Efficient Planning in Reinforcement Learning via Model Introspection

通过模型内省实现强化学习中的高效规划

Do We Need Adam? Surprisingly Strong and Sparse Reinforcement Learning with SGD in LLMs

我们需要亚当吗?在大型语言模型中,SGD的强而稀疏的强化学习

The Laplacian Keyboard: Beyond the Linear Span

拉普拉斯键盘:超越线性跨度

Preference Conditioned Multi-Objective Reinforcement Learning: Decomposed, Diversity-Driven Policy Optimization

偏好条件多目标强化学习:分解、多样性驱动的策略优化

Generative Reasoning Re-ranker

生成推理重新排序器

CoLF: Learning Consistent Leader-Follower Policies for Vision-Language-Guided Multi-Robot Cooperative Transport

CoLF:学习视觉语言引导多机器人协作运输的一致领导者-跟随者政策

Uncertainty-Aware Counterfactual Traffic Signal Control with Predictive Safety and Starvation-Avoidance Constraints Using Vision-Based Sensing

基于视觉的感知,具备预测安全和饥饿避免约束的不确定性感知反事实交通信号控制

VideoTemp-o3: Harmonizing Temporal Grounding and Video Understanding in Agentic Thinking-with-Videos

VideoTemp-o3:协调时间基础与视频理解中的智能思维

Time Series Reasoning via Process-Verifiable Thinking Data Synthesis and Scheduling for Tailored LLM Reasoning

通过过程可验证思维、数据综合和调度进行时间序列推理,实现定制化的LLM推理

rePIRL: Learn PRM with Inverse RL for LLM Reasoning

rePIRL:用逆强化学习PRM进行LLM推理

RLinf-USER: A Unified and Extensible System for Real-World Online Policy Learning in Embodied AI

RLinf-USER:一个统一且可扩展的系统,用于具身人工智能中真实世界的在线政策学习

TodoEvolve: Learning to Architect Agent Planning Systems

TodoEvolve:学习构建代理规划系统

MARTI-MARS$^2$: Scaling Multi-Agent Self-Search via Reinforcement Learning for Code Generation

MARTI-MARS$^2$:通过强化学习扩展多智能体自我搜索以实现代码生成

Direct Soft-Policy Sampling via Langevin Dynamics

通过朗之文动力学进行直接软政策采样

ToolSelf: Unifying Task Execution and Self-Reconfiguration via Tool-Driven Intrinsic Adaptation

ToolSelf:通过工具驱动的内在适应统一任务执行与自我重组

Efficient Anti-exploration via VQVAE and Fuzzy Clustering in Offline Reinforcement Learning

通过VQVAE和离线强化学习中的模糊聚类实现高效的反探索

AceGRPO: Adaptive Curriculum Enhanced Group Relative Policy Optimization for Autonomous Machine Learning Engineering

AceGRPO:自主机器学习工程的自适应课程增强组相对策略优化

Feasibility-Guided Planning over Multi-Specialized Locomotion Policies

可行性导向规划,优先于多专业化的交通政策

Trajectory-Aware Multi-RIS Activation and Configuration: A Riemannian Diffusion Method

轨迹感知多RIS激活与构型:黎曼扩散方法

DHEA-MECD: An Embodied Intelligence-Powered DRL Algorithm for AUV Tracking in Underwater Environments with High-Dimensional Features

DHEA-MECD:一种具身智能驱动的DRL算法,用于高维特征水下环境的AUV跟踪

D-ORCA: Dialogue-Centric Optimization for Robust Audio-Visual Captioning

D-ORCA:以对话为中心的强健视听字幕优化

When Is Compositional Reasoning Learnable from Verifiable Rewards?

什么时候可以从可验证的奖励中学习组合推理?

Regret Analysis of Unichain Average Reward Constrained MDPs with General Parameterization

带一般参数化的单链平均奖励约束MDP的遗憾分析

Horizon Imagination: Efficient On-Policy Training in Diffusion World Models

地平线想象力:扩散世界模型中的高效政策训练

FIRE: Frobenius-Isometry Reinitialization for Balancing the Stability-Plasticity Tradeoff

FIRE:弗罗贝尼乌斯-等距调整重新初始化以平衡稳定性与塑性权衡

Graph-Enhanced Deep Reinforcement Learning for Multi-Objective Unrelated Parallel Machine Scheduling

多目标无关并行机调度的图增强深度强化学习

Epigraph-Guided Flow Matching for Safe and Performant Offline Reinforcement Learning

Epigraph引导的流程匹配,实现安全高效的离线强化学习

Objective Decoupling in Social Reinforcement Learning: Recovering Ground Truth from Sycophantic Majorities

社会强化学习中的客观解耦:从谄媚多数中恢复真实性

Interpretable Failure Analysis in Multi-Agent Reinforcement Learning Systems

多智能体强化学习系统中的可解释故障分析

CADO: From Imitation to Cost Minimization for Heatmap-based Solvers in Combinatorial Optimization

CADO:从模仿到基于热图的组合优化求解器的成本最小化

DrugR: Optimizing Molecular Drugs through LLM-based Explicit Reasoning

DrugR:通过基于LLM的显式推理优化分子药物

SkillRL: Evolving Agents via Recursive Skill-Augmented Reinforcement Learning

SkillRL:通过递归技能增强强化学习进化代理

Document Reconstruction Unlocks Scalable Long-Context RLVR

文档重建解锁可扩展的长上下文RLVR

Do MLLMs Really See It: Reinforcing Visual Attention in Multimodal LLMs

MLLMs真的看见了吗:在多模态LLM中强化视觉注意力

Learning in Context, Guided by Choice: A Reward-Free Paradigm for Reinforcement Learning with Transformers

以选择为引导的语境学习:一个无奖励的Transformers强化学习范式

When Do Multi-Agent Systems Outperform? Analysing the Learning Efficiency of Agentic Systems

多智能体系统何时表现优异?分析智能系统的学习效率

New Skills or Sharper Primitives? A Probabilistic Perspective on the Emergence of Reasoning in RLVR

新技能还是更锋利的原始武器?关于RLVR中推理出现的概率视角

Improving Data and Reward Design for Scientific Reasoning in Large Language Models

改进大型语言模型中科学推理的数据和奖励设计

Towards Efficient Large Language Reasoning Models via Extreme-Ratio Chain-of-Thought Compression

通过极端比率思维链压缩实现高效的大语言推理模型

Who Deserves the Reward? SHARP: Shapley Credit-based Optimization for Multi-Agent System

谁配得上这份奖励?SHARP:基于Shapley信用的多智能体系统优化

OPE: Overcoming Information Saturation in Parallel Thinking via Outline-Guided Path Exploration

OPE:通过大纲引导路径探索的并行思维克服信息饱和

Does Your Reasoning Model Implicitly Know When to Stop Thinking?

你的推理模型是否隐含地知道什么时候该停止思考?

Learning Human-Like Badminton Skills for Humanoid Robots

学习类人机器人羽毛球技能

Reinforcement Learning with Backtracking Feedback

带回溯反馈的强化学习

Dynamic Long Context Reasoning over Compressed Memory via End-to-End Reinforcement Learning

通过端到端强化学习对压缩记忆进行动态长上下文推理

Intelligent support for Human Oversight: Integrating Reinforcement Learning with Gaze Simulation to Personalize Highlighting

智能支持人类监督:将强化学习与凝视模拟整合,实现个性化高亮

Beyond Correctness: Learning Robust Reasoning via Transfer

超越正确性:通过转移学习稳健推理

Contextual Rollout Bandits for Reinforcement Learning with Verifiable Rewards

情境推广强化学习的强化盗贼,提供可验证奖励

Learning Self-Correction in Vision-Language Models via Rollout Augmentation

通过扩展增强学习视觉-语言模型中的自我纠正

Dialogue Model Optimization via Agent Game and Adaptive Tree-based GRPO

通过代理博弈和基于自适应树的GRPO进行对话模型优化

Constrained Sampling to Guide Universal Manipulation RL

约束抽样指导通用作强化学习

SemiNFT: Learning to Transfer Presets from Imitation to Appreciation via Hybrid-Sample Reinforcement Learning

半NFT:通过混合样本强化学习,学习从模仿转移到欣赏的预设

Conditional Sequence Modeling for Safe Reinforcement Learning

安全强化学习的条件序列建模

Beyond Scalar Scores: Reinforcement Learning for Error-Aware Quality Estimation of Machine Translation

超越标量分数:机器翻译错误感知质量估计的强化学习

Breaking the Grid: Distance-Guided Reinforcement Learning in Large Discrete and Hybrid Action Spaces

打破网格:大型离散与混合行动空间中的距离引导强化学习

High-Speed Vision-Based Flight in Clutter with Safety-Shielded Reinforcement Learning

基于高速视觉的杂波飞行,配合安全屏蔽增强学习

From Robotics to Sepsis Treatment: Offline RL via Geometric Pessimism

从机器人到败血症治疗:通过几何悲观主义实现离线强化学习

LLaDA2.1: Speeding Up Text Diffusion via Token Editing

LLaDA2.1:通过令牌编辑加快文本扩散

Learning To Sample From Diffusion Models Via Inverse Reinforcement Learning

通过逆强化学习从扩散模型中采样学习

SoK: The Pitfalls of Deep Reinforcement Learning for Cybersecurity

SoK:深度强化学习在网络安全中的陷阱

Finite-State Controllers for (Hidden-Model) POMDPs using Deep Reinforcement Learning

用于(隐藏模型)POMDP的有限状态控制器,使用深度强化学习

Bayesian Preference Learning for Test-Time Steerable Reward Models

测试时间可引导奖励模型的贝叶斯偏好学习

VideoVeritas: AI-Generated Video Detection via Perception Pretext Reinforcement Learning

VideoVeritas:通过感知前提强化学习实现的AI生成视频检测

Learning the Value Systems of Societies with Preference-based Multi-objective Reinforcement Learning

通过基于偏好的多目标强化学习学习社会的价值体系

Dr. MAS: Stable Reinforcement Learning for Multi-Agent LLM Systems

MAS博士:多智能体大型语言模型系统的稳定强化学习

AnomSeer: Reinforcing Multimodal LLMs to Reason for Time-Series Anomaly Detection

AnomSeer:强化多模态大型语言模型以推理时间序列异常检测

Efficient and Stable Reinforcement Learning for Diffusion Language Models

扩散语言模型的高效稳定强化学习

StealthRL: Reinforcement Learning Paraphrase Attacks for Multi-Detector Evasion of AI-Text Detectors

StealthRL:多重检测器规避人工智能文本检测器的强化学习释义攻击

Learning to Coordinate via Quantum Entanglement in Multi-Agent Reinforcement Learning

多智能体强化学习中的量子纠缠协调学习

Contraction Metric Based Safe Reinforcement Learning Force Control for a Hydraulic Actuator with Real-World Training

基于收缩度量的安全加固学习力控制,适用于液压执行器,具备真实训练

iGRPO: Self-Feedback-Driven LLM Reasoning

iGRPO:自我反馈驱动的大型语言模型推理

WorldCompass: Reinforcement Learning for Long-Horizon World Models

世界指南针:长视界世界模型的强化学习

TwinRL-VLA: Digital Twin-Driven Reinforcement Learning for Real-World Robotic Manipulation

TwinRL-VLA:数字孪生驱动强化学习,用于现实世界机器人作

Keyword: diffusion policy

Trace-Focused Diffusion Policy for Multi-Modal Action Disambiguation in Long-Horizon Robotic Manipulation

长视野机器人作中多模态作用消歧的痕迹聚焦扩散政策

STEP: Warm-Started Visuomotor Policies with Spatiotemporal Consistency Prediction

STEP:带时空一致性预测的热启动动力运动保单