生成时间: 2026-02-06 16:46:32 (UTC+8); Arxiv 发布时间: 2026-02-06 20:00 EST (2026-02-07 09:00 UTC+8)

今天共有 52 篇相关文章

Keyword: reinforcement learning

Gradually Compacting Large Language Models for Reasoning Like a Boiling Frog

逐步压缩大型语言模型,像煮青蛙一样推理

Euphonium: Steering Video Flow Matching via Process Reward Gradient Guided Stochastic Dynamics

上低音号:通过过程奖励梯度引导随机动力学进行引导视频流匹配

Privileged Information Distillation for Language Models

语言模型的特权信息蒸馏

Laplacian Representations for Decision-Time Planning

决策时间规划中的拉普拉斯表示

ReFORM: Reflected Flows for On-support Offline RL via Noise Manipulation

ReFORM:通过噪声控实现支持离线强化学习的反射流

Optimizing Mission Planning for Multi-Debris Rendezvous Using Reinforcement Learning with Refueling and Adaptive Collision Avoidance

利用加油和自适应碰撞避免的强化学习优化多碎片会合任务规划

Reinforcement Learning Enhancement Using Vector Semantic Representation and Symbolic Reasoning for Human-Centered Autonomous Emergency Braking

利用矢量语义表示和符号推理实现以人为中心的自主紧急制动的强化学习增强

Beware Untrusted Simulators -- Reward-Free Backdoor Attacks in Reinforcement Learning

警惕不可信模拟器——强化学习中的无奖励后门攻击

EBPO: Empirical Bayes Shrinkage for Stabilizing Group-Relative Policy Optimization

EBPO:经验贝叶斯缩减以稳定群体相对政策优化

Data-Centric Interpretability for LLM-based Multi-Agent Reinforcement Learning

基于LLM的数据可解释性,用于基于LLM的多智能体强化学习

MobileManiBench: Simplifying Model Verification for Mobile Manipulation

MobileManiBench:简化移动作的模型验证

RFM-Pose:Reinforcement-Guided Flow Matching for Fast Category-Level 6D Pose Estimation

RFM-Pose:用于快速类别级6D姿态估计的强化引导流匹配

Length-Unbiased Sequence Policy Optimization: Revealing and Controlling Response Length Variation in RLVR

长度无偏序列策略优化:揭示和控制RLVR中反应长度变化

Back to Basics: Revisiting Exploration in Reinforcement Learning for LLM Reasoning via Generative Probabilities

回归基础:通过生成概率重新探讨强化学习中的LLM推理

Formal Synthesis of Certifiably Robust Neural Lyapunov-Barrier Certificates

认证稳健神经里雅普诺夫障碍证书的形式综合

GAS: Enhancing Reward-Cost Balance of Generative Model-assisted Offline Safe RL

GAS:增强生成模型辅助离线安全强化学习的奖励与成本平衡

Imagine a City: CityGenAgent for Procedural 3D City Generation

想象一座城市:用于程序化3D城市生成的CityGenAgent

Rich-Media Re-Ranker: A User Satisfaction-Driven LLM Re-ranking Framework for Rich-Media Search

Rich-Media Re-Ranker:一个以用户满意度为驱动的 Rich-Media 搜索大型语言模型重新排序框架

DistillER: Knowledge Distillation in Entity Resolution with Large Language Models

DistillER:利用大型语言模型进行实体解析中的知识蒸馏

When Are RL Hyperparameters Benign? A Study in Offline Goal-Conditioned RL

什么时候强化超参数是良性的?离线目标条件强化学习研究

ALIVE: Awakening LLM Reasoning via Adversarial Learning and Instructive Verbal Evaluation

活着:通过对抗性学习和教学性言语评估觉醒LLM推理

A Unified Framework for Rethinking Policy Divergence Measures in GRPO

重新思考GRPO政策分歧措施的统一框架

Unveiling Implicit Advantage Symmetry: Why GRPO Struggles with Exploration and Difficulty Adaptation

揭示隐性优势对称性:为什么GRPO在探索和难度适应方面遇到困难

TOLEBI: Learning Fault-Tolerant Bipedal Locomotion via Online Status Estimation and Fallibility Rewards

TOLEBI:通过在线状态估计和错误奖励学习容错双足行走

HiCrowd: Hierarchical Crowd Flow Alignment for Dense Human Environments

HiCrowd:密集人类环境下的层级人群流动对齐

Mode-Dependent Rectification for Stable PPO Training

模式依赖整流以实现PPO稳定训练

Rewards as Labels: Revisiting RLVR from a Classification Perspective

奖励作为标签:从分类视角重新审视RLVR

UAV Trajectory Optimization via Improved Noisy Deep Q-Network

通过改进的噪声深Q网络实现无人机轨迹优化

Anchored Policy Optimization: Mitigating Exploration Collapse Via Support-Constrained Rectification

锚定策略优化:通过支持限制的纠正缓解探索崩溃

Mitigating Hallucination in Financial Retrieval-Augmented Generation via Fine-Grained Knowledge Verification

通过细粒度知识验证缓解金融检索增强生成中的幻觉

Learning to Inject: Automated Prompt Injection via Reinforcement Learning

学习注入:通过强化学习实现自动提示注入

LongR: Unleashing Long-Context Reasoning via Reinforcement Learning with Dense Utility Rewards

LongR:通过强化学习释放长上下文推理,并结合密集效用奖励

RL-VLA$^3$: Reinforcement Learning VLA Accelerating via Full Asynchronism

RL-VLA$^3$:强化学习VLA通过完全异步加速

Cross-Domain Offline Policy Adaptation via Selective Transition Correction

通过选择性转换纠正实现跨域离线策略适配

Distributional Reinforcement Learning with Diffusion Bridge Critics

分布式强化学习与扩散桥批评者

TKG-Thinker: Towards Dynamic Reasoning over Temporal Knowledge Graphs via Agentic Reinforcement Learning

TKG-Thinker:通过代理强化学习,迈向动态推理,超越时间知识图谱

Weaver: End-to-End Agentic System Training for Video Interleaved Reasoning

Weaver:视频交错推理的端到端智能系统培训

UI-Mem: Self-Evolving Experience Memory for Online Reinforcement Learning in Mobile GUI Agents

UI-Mem:用于移动图形界面代理在线强化学习的自我演化体验记忆

Dr. Kernel: Reinforcement Learning Done Right for Triton Kernel Generations

柯内尔博士:特里顿内核世代的强化学习

DFPO: Scaling Value Modeling via Distributional Flow towards Robust and Generalizable LLM Post-Training

DFPO:通过分布流向鲁棒且可推广的后训练LLM进行价值建模

Residual Reinforcement Learning for Waste-Container Lifting Using Large-Scale Cranes with Underactuated Tools

使用大型起重机和欠驱动工具进行废物集装箱搬运的残余加固学习

Stop Rewarding Hallucinated Steps: Faithfulness-Aware Step-Level Reinforcement Learning for Small Reasoning Models

停止奖励幻觉步骤:忠实感知的步骤级强化学习针对小推理模型

Quantum Reinforcement Learning with Transformers for the Capacitated Vehicle Routing Problem

利用变压器实现的量子强化学习,用于电容车辆路由问题

Approximation of Log-Partition Function in Policy Mirror Descent Induces Implicit Regularization for LLM Post-Training

在策略镜像下降中对日志划分函数的近似,诱导了LLM后训练的隐式正则化

$f$-GRPO and Beyond: Divergence-Based Reinforcement Learning Algorithms for General LLM Alignment

$f$-GRPO及其后:基于发散的强化学习算法用于通用LLM对齐

Learning to Share: Selective Memory for Efficient Parallel Agentic Systems

学会分享:高效并行智能系统的选择性记忆

VisRefiner: Learning from Visual Differences for Screenshot-to-Code Generation

VisRefiner:从视觉差异中学习截图到代码生成

On Computation and Reinforcement Learning

关于计算与强化学习

Learning Query-Aware Budget-Tier Routing for Runtime Agent Memory

学习运行时代理内存的查询感知预算层路由

Can vision language models learn intuitive physics from interaction?

视觉语言模型能从互动中学习直觉物理吗?

V-Retrver: Evidence-Driven Agentic Reasoning for Universal Multimodal Retrieval

V-Retrver:基于证据的代理推理用于普遍多模态检索

InterPrior: Scaling Generative Control for Physics-Based Human-Object Interactions

InterPrior:基于物理的人与物交互的生成控制尺度化

Keyword: diffusion policy

There is no result