生成时间: 2026-02-04 16:48:09 (UTC+8); Arxiv 发布时间: 2026-02-04 20:00 EST (2026-02-05 09:00 UTC+8)

今天共有 77 篇相关文章

Keyword: reinforcement learning

GraphDancer: Training LLMs to Explore and Reason over Graphs via Curriculum Reinforcement Learning

GraphDancer:通过课程强化学习训练大型语言模型在图谱上进行探索和推理

Formulating Reinforcement Learning for Human-Robot Collaboration through Off-Policy Evaluation

通过非策略评估制定人机协作强化学习

Hypersonic Flow Control: Generalized Deep Reinforcement Learning for Hypersonic Intake Unstart Control under Uncertainty

高超音速流量控制:在不确定性下高超音速进气停启控制的广义深度强化学习

CADENT: Gated Hybrid Distillation for Sample-Efficient Transfer in Reinforcement Learning

CADENT:门控混合蒸馏用于强化学习中的样本高效转移

Beyond Alignment: Expanding Reasoning Capacity via Manifold-Reshaping Policy Optimization

超越对齐:通过流形重塑策略优化扩展推理能力

BatCoder: Self-Supervised Bidirectional Code-Documentation Learning via Back-Translation

BatCoder:通过反向翻译实现的自我监督双向代码-文档学习

Learning to Explore with Parameter-Space Noise: A Deep Dive into Parameter-Space Noise for Reinforcement Learning with Verifiable Rewards

用参数空间噪声学习探索:深入探讨参数空间噪声用于可验证奖励的强化学习

QuantLRM: Quantization of Large Reasoning Models via Fine-Tuning Signals

QuantLRM:通过微调信号对大型推理模型进行量化

ContextEvolve: Multi-Agent Context Compression for Systems Code Optimization

ContextEvolve:多智能体上下文压缩以实现系统代码优化

BinaryPPO: Efficient Policy Optimization for Binary Classification

二元PPO:二元分类的高效策略优化

Maximum Likelihood Reinforcement Learning

最大似然强化学习

Hierarchical Entity-centric Reinforcement Learning with Factored Subgoal Diffusion

层级实体中心强化学习与分解子目标扩散

From Tokens to Numbers: Continuous Number Modeling for SVG Generation

从代币到数字:SVG生成的连续数字建模

Adaptive Linear Path Model-Based Diffusion

基于模型的自适应线性路径扩散

Causal Flow Q-Learning for Robust Offline Reinforcement Learning

因果流Q-Learning用于稳健的离线强化学习

Latent Perspective-Taking via a Schrödinger Bridge in Influence-Augmented Local Models

在影响增强局部模型中通过薛定谔桥实现潜在透视获取

IMAGINE: Intelligent Multi-Agent Godot-based Indoor Networked Exploration

想象一下:基于Godot的智能多智能体室内网络探索

Manifold-Constrained Energy-Based Transition Models for Offline Reinforcement Learning

离线强化学习的流形约束能量转移模型

Spatiotemporal Decision Transformer for Traffic Coordination

时空决策变换器用于交通协调

Notes on the Reward Representation of Posterior Updates

关于后期更新奖励表示的注释

How Does the Lagrangian Guide Safe Reinforcement Learning through Diffusion Models?

拉格朗日导导如何通过扩散模型实现安全强化学习?

Human-Centric Traffic Signal Control for Equity: A Multi-Agent Action Branching Deep Reinforcement Learning Approach

以人为本的交通信号控制实现公平:一种多智能体行动分支深度强化学习方法

Embodiment-Aware Generalist Specialist Distillation for Unified Humanoid Whole-Body Control

具身感知通用专家提炼,实现统一的类人生物全身控制

Co2PO: Coordinated Constrained Policy Optimization for Multi-Agent RL

Co2PO:多智能体强化语言的协调受限策略优化

Learning Fast Monomial Orders for Gröbner Basis Computations

学习格罗布纳基计算中的快速单项式序

Structuring Value Representations via Geometric Coherence in Markov Decision Processes

通过几何相干性在马尔可夫决策过程中构建价值表示

CPMobius: Iterative Coach-Player Reasoning for Data-Free Reinforcement Learning

CPMobius:无数据强化学习的迭代教练-球员推理

Video-OPD: Efficient Post-Training of Multimodal Large Language Models for Temporal Video Grounding via On-Policy Distillation

视频-OPD:通过策略上蒸馏实现多模态大型语言模型的高效后期训练,实现时间视频基础

CoBA-RL: Capability-Oriented Budget Allocation for Reinforcement Learning in LLMs

CoBA-RL:面向能力的预算分配用于大型语言模型中的强化学习

TMS: Trajectory-Mixed Supervision for Reward-Free, On-Policy SFT

TMS:轨迹混合监督,针对无奖励、按政策进行的SFT

ReMiT: RL-Guided Mid-Training for Iterative LLM Evolution

ReMiT:基于强化学习的迭代大型语言模型演化中期训练

Neural Predictor-Corrector: Solving Homotopy Problems with Reinforcement Learning

神经预测-校正器:通过强化学习解决同伦问题

Training and Simulation of Quadrupedal Robot in Adaptive Stair Climbing for Indoor Firefighting: An End-to-End Reinforcement Learning Approach

室内消防自适应楼梯攀爬中的四足机器人训练与模拟:端到端强化学习方法

Test-time Recursive Thinking: Self-Improvement without External Feedback

测试时递归思维:无外部反馈的自我提升

One Model, All Roles: Multi-Turn, Multi-Agent Self-Play Reinforcement Learning for Conversational Social Intelligence

一个模型,所有角色:多回合、多代理自我游戏强化学习,用于会话社会智能

Quantized Evolution Strategies: High-precision Fine-tuning of Quantized LLMs at Low-precision Cost

量子化进化策略:以低精度成本实现量化大型语言模型的高精度微调

Short Chains, Deep Thoughts: Balancing Reasoning Efficiency and Intra-Segment Capability via Split-Merge Optimization

短链,深度思考:通过分割合并优化平衡推理效率与段内能力

Self-Hinting Language Models Enhance Reinforcement Learning

自我提示语言模型增强强化学习

Intelligent Front-End Personalization: AI-Driven UI Adaptation

智能前端个性化:AI驱动的用户界面适应

StepScorer: Accelerating Reinforcement Learning with Step-wise Scoring and Psychological Regret Modeling

StepScorer:通过分步评分和心理后悔建模加速强化学习

Prompt Augmentation Scales up GRPO Training on Mathematical Reasoning

提示增强提升GRPO数学推理培训

Reinforcement Learning with Promising Tokens for Large Language Models

大型语言模型中带有有前景的代币的强化学习

From Scalar Rewards to Potential Trends: Shaping Potential Landscapes for Model-Based Reinforcement Learning

从标量奖励到潜在趋势:塑造基于模型的强化学习的潜在景观

ForesightKV: Optimizing KV Cache Eviction for Reasoning Models by Learning Long-Term Contribution

ForesightKV:通过学习长期贡献优化推理模型中的KV缓存驱逐

Accordion-Thinking: Self-Regulated Step Summaries for Efficient and Readable LLM Reasoning

手风琴思维:高效且易读的大型语言模型推理的自我调节步骤摘要

Periodic Regularized Q-Learning

周期正则化Q-学习

medR: Reward Engineering for Clinical Offline Reinforcement Learning via Tri-Drive Potential Functions

medR:通过Tri-Drive潜在函数实现临床离线强化学习的奖励工程

Entropy-Gated Selective Policy Optimization:Token-Level Gradient Allocation for Hybrid Training of Large Language Models

熵门控选择性策略优化:大型语言模型混合训练中的令牌级梯度分配

MedSAM-Agent: Empowering Interactive Medical Image Segmentation with Multi-turn Agentic Reinforcement Learning

MedSAM-Agent:通过多回合智能体强化学习赋能交互式医学图像分割

MentalSeek-Dx: Towards Progressive Hypothetico-Deductive Reasoning for Real-world Psychiatric Diagnosis

MentalSeek-DX:迈向现实世界精神病诊断的渐进假设演绎推理

PEGRL: Improving Machine Translation by Post-Editing Guided Reinforcement Learning

PEGRL:通过后期编辑引导强化学习提升机器翻译

An Approximate Ascent Approach To Prove Convergence of PPO

一种近似上升方法以证明PPO收敛性

Chain-of-Goals Hierarchical Policy for Long-Horizon Offline Goal-Conditioned RL

长期目标链层级策略用于长期离线目标条件强化学习

Enhancing Navigation Efficiency of Quadruped Robots via Leveraging Personal Transportation Platforms

通过利用个人交通平台提升四足机器人的导航效率

Learning-based Initialization of Trajectory Optimization for Path-following Problems of Redundant Manipulators

基于学习的轨迹优化初始化,针对冗余作手的路径跟随问题

CRL-VLA: Continual Vision-Language-Action Learning

CRL-VLA:持续视觉-语言-行动学习

Beyond Variance: Prompt-Efficient RLVR via Rare-Event Amplification and Bidirectional Pairing

超越方差:通过罕见事件放大和双向配对实现提示高效的RLVR

IntentRL: Training Proactive User-intent Agents for Open-ended Deep Research via Reinforcement Learning

IntentRL:通过强化学习培训主动用户意图代理进行开放式深度研究

Decoupling Skeleton and Flesh: Efficient Multimodal Table Reasoning with Disentangled Alignment and Structure-aware Guidance

骨架与肉体解耦:高效多模态表推理,结合解缠比对和结构感知指导

Reparameterization Flow Policy Optimization

重新参数化流程策略优化

Learning to Reason Faithfully through Step-Level Faithfulness Maximization

通过阶级忠实度最大化学会忠实推理

CMR: Contractive Mapping Embeddings for Robust Humanoid Locomotion on Unstructured Terrains

CMR:用于无结构地形上强健类人机动的收缩映射嵌入

Not All Negative Samples Are Equal: LLMs Learn Better from Plausible Reasoning

并非所有负面样本都一样:大型语言模型从合理的推理中学习得更好

AffordanceGrasp-R1:Leveraging Reasoning-Based Affordance Segmentation with Reinforcement Learning for Robotic Grasping

AffordanceGrasp-R1:利用基于推理的可理解性分割与强化学习进行机器人抓取

Learning Query-Specific Rubrics from Human Preferences for DeepResearch Report Generation

学习DeepResearch报告生成中针对特定查询的评分标准

TRE: Encouraging Exploration in the Trust Region

TRE:鼓励在信托区进行勘探

Reinforcement Fine-Tuning for History-Aware Dense Retriever in RAG

RAG中历史感知密集寻回犬的强化微调

Search-R2: Enhancing Search-Integrated Reasoning via Actor-Refiner Collaboration

Search-R2:通过演员与精炼器协作增强搜索集成推理

Rethinking the Reranker: Boundary-Aware Evidence Selection for Robust Retrieval-Augmented Generation

重新思考重新排序器:边界感知证据选择以实现强健检索增强生成

Training Multi-Turn Search Agent via Contrastive Dynamic Branch Sampling

通过对比动态分支采样训练多回合搜索代理

RegionReasoner: Region-Grounded Multi-Round Visual Reasoning

RegionReasoner:基于区域的多轮视觉推理

Reasoning Cache: Continual Improvement Over Long Horizons via Short-Horizon RL

推理缓存:通过短视野强化学习实现长期改进

Efficient Estimation of Kernel Surrogate Models for Task Attribution

任务归因中核代理模型的高效估计

Bridging Online and Offline RL: Contextual Bandit Learning for Multi-Turn Code Generation

连接线上与线下强化学习:多回合代码生成的上下文盗贼学习

SymPlex: A Structure-Aware Transformer for Symbolic PDE Solving

SymPlex:一种用于符号偏微分方程求解的结构感知变换器

Understanding and Exploiting Weight Update Sparsity for Communication-Efficient Distributed RL

理解并利用权重更新稀缺性以实现通信高效分布式强化学习

Keyword: diffusion policy

How Does the Lagrangian Guide Safe Reinforcement Learning through Diffusion Models?

拉格朗日导导如何通过扩散模型实现安全强化学习?