生成时间: 2026-06-09 19:21:11 (UTC+8); Arxiv 发布时间: 2026-06-09 20:00 EDT (2026-06-10 08:00 UTC+8)

今天共有 81 篇相关文章

Keyword: reinforcement learning

TinyJudge: Unverifiable Constraint Alignment via Lightweight Specialist Ensembles

TinyJudge:通过轻量级专业合奏实现不可验证的约束对齐

Offline Reinforcement Learning for Plasma Control in Nuclear Fusion: Codebase and Benchmark

核聚变中等离子体控制的离线强化学习:代码库与基准

Outage Detection in Self-Healing Smart Grids Using Reinforcement Learning with Spectral Graph Neural Networks

利用谱图神经网络强化学习实现自愈智能电网的停电检测

UNIQ: Conformal Calibration for Adaptive Conservatism in Offline Reinforcement Learning

UNIQ:离线强化学习中适应性保守主义的共形校准

Sample-Efficient Post-Training for LEGO Spatial-Physics Reasoning

乐高空间物理推理的样本高效后期训练

SAW: Stage-Aware Dynamic Weighting for Multi-Objective Reinforcement Learning in Large Language Models

SAW:大型语言模型中多目标强化学习的阶段感知动态加权

Belief-Space Quantum-Inspired Reinforcement Learning for Partially Observable Autonomous Cyber Defense in the Internet of Vehicles

信念空间量子启发强化学习,用于车辆互联网中部分可观察的自主网络防御

Quantum-Inspired Reinforcement Learning for Low-Latency Intrusion Detection in V2X and Internet-of-Vehicles Networks

量子启发强化学习用于V2X和车联网中低延迟入侵检测

X-OP: Cross-Morphology Whole-Body Teleoperation via MPC Retargeting

X-OP:通过MPC重定向实现的跨形态全身远程操作

Customer-Agent: Overcoming Context Limitations in Ultra-Long Shopping Trajectories via Tool-Augmented Agents and RLVR

客户-代理:通过工具增强代理和RLVR克服超长购物轨迹中的上下文限制

Rewrite to Translate, Translate to Reward: Reinforcement Learning for Source Rewriting in Machine Translation

重写以翻译,翻译以奖励:机器翻译中源头重写的强化学习

Q-VGM: Q-Guided Value-Gradient Matching for Flow-Matching VLA Policies

Q-VGM:Q-引导值梯度匹配用于流量匹配VLA策略

DyCo-RL: Dynamic Cross-Modal Coordination for Visual Reasoning

DyCo-RL:视觉推理的动态跨模态协调

MuJoCo-Drones-Gym: A GPU-Accelerated Multi-Drone Simulator for Control and Reinforcement Learning

MuJoCo-Drones-Gym:一款GPU加速的多无人机模拟器,用于控制与强化学习

Robust-U1: Can MLLMs Self-Recover Corrupted Visual Content for Robust Understanding?

Robust-U1:MLLM能否自我恢复损坏的视觉内容以实现稳健理解?

Cooperative Long Rope Skipping via Multi-Agent Reinforcement Learning

通过多智能体强化学习实现合作式长绳跳跃

ConSteer-RL: Steering Reasoning Capabilities in Large Language Models via Confidence-Aware Reinforcement Learning

ConSteer-RL:通过信心感知强化学习在大型语言模型中引导推理能力

Continual Quadruped Robots Coordination via Semantic Skill Discovery

通过语义技能发现实现四足机器人持续协调

Reinforcement learning in linear embedding space unlocks generalizable control across soft robot configurations

线性嵌入空间中的强化学习解锁了软机器人配置中的通用控制

Learning Predictive Control with Deep Koopman Operators for Autonomous Vehicle Motion Planning

学习Deep Koopman操作员的自动驾驶车辆运动规划预测控制

Mind Your Steps: A General Learning Framework for Accurate Humanoid Foothold Tracking

注意你的步伐:一个用于准确人形足迹追踪的通用学习框架

Neuro-Symbolic Injection of LTLf Constraints in Autoregressive Reinforcement Learning Policies

自回归强化学习策略中LTLf约束的神经符号注入

CATPO: Critique-Augmented Tree Policy Optimization

CATPO:批判性增强树策略优化

Self-Evolving Scientific Agent Discovers Generalizable Physically-Reasoned Fluid Control

自我进化的科学代理人发现了可推广的物理推理流体控制

Sparrow: Sparse Rollout for Stable and Efficient Long-context RL of Large Language Models

Sparrow:为大型语言模型稳定高效长上下文强化学习提供稀疏推广

GIFT: LLM-Guided State-Reward Interface for Financial Reinforcement Learning

GIFT:以LLM为导向的状态-奖励接口,用于财务强化学习

Adaptive Loss Balancing for Noise-Robust GRPO in Generative Recommendation

生成推荐中噪声鲁棒GRPO的自适应损耗均衡

Back on Track: Aligning Rewards and States for Reasoning in Diffusion Large Language Models

回到正题:在扩散大型语言模型中,如何对齐奖励与状态以推理

Towards End to End Motion Planning and Execution for Autonomous Underwater Vehicles Using Reinforcement Learning

关于基于强化学习的自主水下载具端到端运动规划与执行

DriveReward: A Comprehensive Dataset and Generative Vision-Language Reward Model for Autonomous Driving

DriveReward:一个全面的数据集和生成式视觉语言奖励模型,适用于自动驾驶

Autonomous Aerial Manipulation via Contextual Contrastive Meta Reinforcement Learning

通过上下文对比元强化学习实现自主空中操作

PAEC: Position-Aware Entropy Calibration for LLM Reasoning in RLVR

PAEC:RLVR中用于LLM推理的位置感知熵校准

Ishigaki-IDS: An Open-Weight Verifier-Aware Model for Information Delivery Specification Drafting in Building Information Modeling

石垣-IDS:一种开放权重验证器感知模型,用于建筑信息建模中的信息传递规范起草

Real-IKEA: Physical Fidelity is the Prerequisite for Robust Manipulation

Real-IKEA:物理保真是强健操作的前提

Distilling LLM Reasoning into an Interpretable Policy Tree for Human-AI Collaboration

将LLM推理提炼成可解释的策略树,用于人机协作

Reinforcement Learning for Flow-Matching Policies with Density Transport

用于流量匹配策略的强化学习与密度传输

HARBOR: A Harness Framework for Agentic Robot Reinforcement Learning

HARBOR:智能机器人强化学习的束带框架

SPA: A SQL-Plan-Aware Reinforcement Learning Framework for Query Rewriting with LLMs

SPA:一个用于用LLM进行查询重写的SQL计划感知强化学习框架

From Holistic Evaluation to Structured Criteria: Rubrics Across the Evolving LLM Landscape

从整体评估到结构化标准:在不断演变的LLM领域中的评分标准

Towards Long-Horizon Vessel Trajectory and Destination Forecasting with Reasoning Large Language Models

迈向远视线船舶轨迹与目标预测,基于推理的大型语言模型

PRPO: Perception-Reinforced Policy Optimization via Token-Level Dynamic Advantage Reshaping

PRPO:通过代币级动态优势重塑实现感知强化策略优化

Structure-Conditioned Actor-Critic Branches for Quality-Diversity Reinforcement Learning

结构条件演员-批评分支用于质量多样性强化学习

Guided Discovery of New Behaviors using Diffusion Policies

利用扩散策略引导发现新行为

Co-Evolving Skill Generation and Policy Optimization

技能生成与政策优化的共进化

Reformulate LLM Reinforcement Learning for Efficient Training under Black-box Discrepancy

重新表述LLM强化学习,以实现黑箱差异下的高效培训

Momentum for Reasoning: Dense Intrinsic Signals in Policy Optimization

推理的动力:政策优化中的密集内在信号

Knowledge Graphs and Reasoning LLMs for Finding Simple Yet Effective Transcriptomic Perturbation Predictors

知识图谱与推理大型模型,用于寻找简单但有效的转录组扰动预测变量

sGPO: Trading Inference FLOPs for Training Efficiency in RLVR

sGPO:在RLVR中用推理浮牌换取训练效率

Multilingual Sentiment Aware Text Summarization A Reinforcement Learning Approach for Consistency Maintenance

多语言情感感知文本摘要:一种用于一致性维护的强化学习方法

AlloSpatial: Agentic Harness Framework for Spatial Reasoning in Foundation Models

全空间:基础模型中空间推理的代理框架

Diverse Thinking Schemata Elicit Better Reasoning in Large Language Models

多样化思维模式在大型语言模型中激发更优的推理能力

Personalization Meets Safety:Mechanisms,Risks,and Mitigations in Personalized LLMs

个性化与安全性的结合:个性化大型语言模型中的机制、风险与缓解措施

Stage-1 Controls the Entropy Regime, Not the Outcome

第一阶段控制熵状态,而非结果

A Unifying Lens on Reward Uncertainty in RLHF

RLHF中奖励不确定性的统一视角

Stabilizing On-Policy Distillation for MLLM Reasoning with Global Normalization

以全局规范化稳定政策提炼MLLM推理

Addressing Market Regime Changes and Heavy-Tailed Returns in Portfolio Optimization via Bayesian VAR and Elliptical Black-Litterman

通过贝叶斯VAR和椭圆黑利特曼应对市场形态变化和投资组合优化中的重尾回报

Counterfactual Transport Flows for Offline Conservative Trajectory Refinement

离线保守轨迹精细化的反事实传输流

AutoPilot: Learning to Steer High Speed Robust BFT

AutoPilot:学习高速且坚固的BFT方向

A Regret Minimization Framework on Preference Learning in Large Language Models

大型语言模型中偏好学习的遗憾最小化框架

Claw-R1: A Step-Level Data Middleware System for Agentic Reinforcement Learning

Claw-R1:一种用于代理强化学习的阶梯级数据中间件系统

Self-Paced Curriculum Reinforcement Learning for Autonomous Superbike Racing in Simulation

自动驾驶超级摩托车赛模拟自定进度课程强化学习

Temporal-Aware Reasoning Optimization for Video Temporal Grounding

视频时间接地的时间感知推理优化

Physics-Guided Sequence-Based Generative Framework for Acoustic Metamaterial Inverse Design

基于物理的序列生成框架用于声学超材料逆向设计

One Model, Multiple Goals: Adaptive Multi-Objective Learning for E-commerce Dialogue Systems

一个模型,多目标:电子商务对话系统的自适应多目标学习

SG-OPD: Sign-Gated On-Policy Distillation via Sign-Consistency Gating and Phased Teacher Sampling

SG-OPD:通过符号一致性门槛和分阶段教师抽样进行的政策签名门控提炼

TORL-VLA: Tactile Guided Online Reinforcement Learning for Contact-Rich Manipulation

TORL-VLA:触觉引导在线强化学习,用于接触丰富操作

PBSD: Privileged Bayesian Self-Distillation for Long-Horizon Credit Assignment

PBSD:长视界学分分配的特权贝叶斯自蒸馏

Reasoning Arena: Trace Tournaments When Verifiable Rewards Fall Short

理性竞技场:当可验证的奖励不足时追踪锦标赛

CapRL++: Unified Reinforcement Learning with Verifiable Rewards for Dense Image and Video Captioning

CapRL++:统一强化学习,提供可验证的密集图片和视频字幕奖励

PriFT: Prior-Support Guided Supervised Fine-Tuning

PriFT:先行支持引导监督微调

AliyunConsoleAgent: Training Web Agents in Real-World Cloud Environments via Distillation and Reinforcement Learning

AliyunConsoleAgent:通过蒸馏和强化学习在真实云环境中训练网络代理

Emergence of Context Characteristics Sensitivity in Large Language Models

大型语言模型中上下文特征敏感性的出现

Safe-RULE: Safe Reinforcement UnLEarning

安全规则:安全强化释放

Shape Formation for the Cooperative Transportation of Arbitrary Objects Using Multi-Agent Reinforcement Learning

利用多智能体强化学习协同运输任意对象的形状形成

Learning to Attack and Defend: Adaptive Red Teaming of Language Models via GRPO

学习攻防:通过GRPO实现的自适应红队语言模型

The Neutral Mask: How RLHF Provides Shallow Alignment while Leaving Partisan Structure Intact in a Large Language Model

中性掩膜:RLHF如何在大型语言模型中保持党派结构的同时,提供浅层对齐

Rethinking the Divergence Regularization in LLM RL

重新思考LLM强化语言中的发散正则化

An Agency-Transferring Model-Free Policy Enhancement Technique

一种无需机构转移的无模式保单增强技术

Keyword: diffusion policy

Latent Diffusion Policy: Shaping Latent Spaces for Diffusion-Based Robotic Manipulation

潜在扩散政策:塑造基于扩散的机器人操作的潜在空间

Guided Discovery of New Behaviors using Diffusion Policies

利用扩散策略引导发现新行为

Unifying Object-Centric World Models and Diffusion Policy: A Hierarchical Framework for Multi-Stage Robotic Tasks

统一以对象为中心的世界模型与扩散政策:多阶段机器人任务的层级框架