生成时间: 2026-03-12 16:48:34 (UTC+8); Arxiv 发布时间: 2026-03-12 20:00 EDT (2026-03-13 08:00 UTC+8)

今天共有 39 篇相关文章

Keyword: reinforcement learning

Evolving Demonstration Optimization for Chain-of-Thought Feature Transformation

思维链特征变换的演进演示优化

Personalized Group Relative Policy Optimization for Heterogenous Preference Alignment

异质偏好对齐的个性化群体相对策略优化

Cluster-Aware Attention-Based Deep Reinforcement Learning for Pickup and Delivery Problems

针对取件和送达问题的群体感知注意力深度强化学习

Improving Search Agent with One Line of Code

用一行代码提升搜索代理

Amnesia: Adversarial Semantic Layer Specific Activation Steering in Large Language Models

失忆症:大型语言模型中的对抗语义层特定激活引导

Code-Space Response Oracles: Generating Interpretable Multi-Agent Policies with Large Language Models

代码空间响应预言机:利用大型语言模型生成可解释的多代理策略

CLIPO: Contrastive Learning in Policy Optimization Generalizes RLVR

CLIPO:政策优化中的对比学习推广RLVR

ReMix: Reinforcement routing for mixtures of LoRAs in LLM finetuning

ReMix:LLM微调中LoRA混合的强化路由

From Prior to Pro: Efficient Skill Mastery via Distribution Contractive RL Finetuning

从之前到专业:通过分布式紧凑型强化精炼实现高效的技能掌握

From Imitation to Intuition: Intrinsic Reasoning for Open-Instance Video Classification

从模仿到直觉:开放实例视频分类的内在推理

SteadyTray: Learning Object Balancing Tasks in Humanoid Tray Transport via Residual Reinforcement Learning

稳步托盘:通过残留强化学习学习人形托盘运输中的对象平衡任务

ScanDP: Generalizable 3D Scanning with Diffusion Policy

ScanDP:具有扩散策略的通用3D扫描

Graph-GRPO: Training Graph Flow Models with Reinforcement Learning

Graph-GRPO:带强化学习的训练图流模型

COHORT: Hybrid RL for Collaborative Large DNN Inference on Multi-Robot Systems Under Real-Time Constraints

队列:在实时约束下多机器人系统上进行协作大型DNN推断的混合强化学习

Muscle Synergy Priors Enhance Biomechanical Fidelity in Predictive Musculoskeletal Locomotion Simulation

肌肉协同先验提升了预测性肌肉骨骼运动模拟中的生物力学精度

IH-Challenge: A Training Dataset to Improve Instruction Hierarchy on Frontier LLMs

IH-Challenge:一个用于提升前沿大型语言模型教学层级结构的训练数据集

UAV-MARL: Multi-Agent Reinforcement Learning for Time-Critical and Dynamic Medical Supply Delivery

UAV-MARL:多智能体强化学习,用于时间关键且动态的医疗物资配送

Tackling Length Inflation Without Trade-offs: Group Relative Reward Rescaling for Reinforcement Learning

无权衡地应对长度膨胀:强化学习中的群体相对奖励重塑

Learning to Score: Tuning Cluster Schedulers through Reinforcement Learning

学习评分:通过强化学习调优集群调度器

Adaptive RAN Slicing Control via Reward-Free Self-Finetuning Agents

通过无奖励自调节智能员实现自适应RAN切片控制

Safety-critical Control Under Partial Observability: Reach-Avoid POMDP meets Belief Space Control

部分可观测性下的安全关键控制:距离-避开POMDP满足信念空间控制

Does LLM Alignment Really Need Diversity? An Empirical Study of Adapting RLVR Methods for Moral Reasoning

LLM对齐真的需要多样性吗?关于将RLVR方法应用于道德推理的实证研究

AdaClearGrasp: Learning Adaptive Clearing for Zero-Shot Robust Dexterous Grasping in Densely Cluttered Environments

AdaClearGrasp:学习零射精打稳健灵活抓取的自适应清理,适应密集环境中的抓取

Reinforcement Learning with Conditional Expectation Reward

带条件期望奖励的强化学习

MAVEN: A Meta-Reinforcement Learning Framework for Varying-Dynamics Expertise in Agile Quadrotor Maneuvers

MAVEN:一个面向敏捷四旋翼机动中变动力学专业知识的元强化学习框架

ASTER: Attitude-aware Suspended-payload Quadrotor Traversal via Efficient Reinforcement Learning

ASTER:通过高效强化学习实现姿态感知悬挂有效载荷四旋翼横转

mAceReason-Math: A Dataset of High-Quality Multilingual Math Problems Ready For RLVR

mAceReason-Math:一个高质量多语言数学题数据集,准备用于RLVR

Multilingual Reasoning Gym: Multilingual Scaling of Procedural Reasoning Environments

多语言推理馆:程序化推理环境的多语言扩展

Towards Cold-Start Drafting and Continual Refining: A Value-Driven Memory Approach with Application to NPU Kernel Synthesis

迈向冷启动绘图与持续精炼:一种价值驱动的内存方法,并应用于NPU内核综合

$V_{0.5}$: Generalist Value Model as a Prior for Sparse RL Rollouts

$V_{0.5}$:作为稀疏RL推广的通用价值模型

RL-Augmented MPC for Non-Gaited Legged and Hybrid Locomotion

RL增强MPC用于非步态腿部和混合运动

Dynamics-Predictive Sampling for Active RL Finetuning of Large Reasoning Models

大型推理模型的动态预测采样用于主动强化学习微调

Ergodicity in reinforcement learning

强化学习中的遍历性

Safe RLHF Beyond Expectation: Stochastic Dominance for Universal Spectral Risk Control

超越预期的安全RLHF:随机优势用于通用频谱风险控制

Contact Coverage-Guided Exploration for General-Purpose Dexterous Manipulation

接触覆盖引导探索,用于通用灵巧作

Learning Adaptive Force Control for Contact-Rich Sample Scraping with Heterogeneous Materials

学习针对非均质材料的接触丰富样品刮除的自适应力控制

Keyword: diffusion policy

Update-Free On-Policy Steering via Verifiers

通过验证器实现无需更新的策略引导

ScanDP: Generalizable 3D Scanning with Diffusion Policy

ScanDP:具有扩散策略的通用3D扫描

PPGuide: Steering Diffusion Policies with Performance Predictive Guidance

PPGuide:利用绩效预测指导引导扩散政策