生成时间: 2026-06-03 20:44:04 (UTC+8); Arxiv 发布时间: 2026-06-03 20:00 EDT (2026-06-04 08:00 UTC+8)

今天共有 46 篇相关文章

Keyword: reinforcement learning

Margin Play: A Multi-Agent System For Public Policy Analysis In The Brazilian Equatorial Margin

边际游戏:巴西赤道边际公共政策分析的多代理系统

Inference Cost Attacks for Retrieval-Augmented Large Language Models

检索增强大型语言模型的推理成本攻击

Motion Planning in Dynamic Environments: A Survey from Classical to Modern Methods

动态环境中的运动规划:从经典到现代方法的综述

Traj-Evolve: A Self-Evolving Multi-Agent System for Patient Trajectory Modeling in Lung Cancer Early Detection

Traj-Evolve:一种用于肺癌早期检测患者轨迹建模的自我演化多智能体系统

Fairness Definitions and Metrics in Deep Reinforcement Learning for Drug Discovery in Healthcare: A Rapid Evidence Review

医疗药物发现深度强化学习中的公平定义与指标:快速证据综述

ConTraIRL: Factorized Contrastive Abstractions for Transferable IRL

ConTraIRL:可转移现实的因式分解对比抽象

Hint-Guided Diversified Policy Optimization for LLM Reasoning

面向大型语言模型推理的提示引导多样化策略优化

Brief Announcement: Generative Markov Model for Distributed Computing Systems

简要公告:分布式计算系统的生成马尔可夫模型

ASymPO: Asymmetric-Scale Policy Optimization for Asynchronous LLM Post-Training Without Behavior Information

ASymPO:无行为信息的异步LLM后训练的非对称尺度策略优化

Efficient Hyperparameter Optimization for LLM Reinforcement Learning

LLM强化学习的高效超参数优化

Libra: Efficient Resource Management for Agentic RL Post-Training

Libra:智能强化学习后培训的高效资源管理

Learning to Solve, Forgetting to Retain: Correct-Set Turnover in RLVR

学会解题,忘记保留:RLVR中的正确周转

FGRPO: Federated GRPO with Adaptive Aggregation on Non-IID Data

FGRPO:结合非IID数据的自适应聚合的联合GRPO

Small RL Controller, Large Language Model: RL-Guided Adaptive Sampling for Test-Time Scaling

小型强化学习控制器,大型语言模型:强化学习引导自适应采样用于测试时间缩放

Experience-Driven Dynamic Exits for LLMs with Reinforcement Learning

基于经验驱动的大型语言模型动态退出与强化学习

Learning to See via Epiretinal Implant Stimulation in silico with Model-Based Deep Reinforcement Learning

通过基于模型的深度强化学习通过视网膜前植入刺激学习在计算机中看见

Cost-Aware Optimization for Agentic Query Execution

代理查询执行的成本感知优化

ConTrack: Constrained Hand Motion Tracking with Adaptive Trade-off Control

ConTrack:受限的手部动作追踪,带自适应权衡控制

MemTrain: Self-Supervised Context Memory Training

MemTrain:自我监督上下文记忆训练

Reinforcement Learning from Cross-domain Videos with Video Prediction Model

基于跨领域视频的视频预测模型强化学习

Right Makes Might: Aligning Verified Hidden States Empowers RL Reasoning

正义造就力量:对齐已验证的隐藏状态增强现实逻辑

When RLHF Fails: A Mechanistic Taxonomy of Reward Hacking, Collapse, and Evaluator Gaming

当RLHF失败时:奖励黑客、崩溃与评估者游戏的机制分类法

PaddleOCR-VL-1.6: Expanding the Frontier of Document Parsing with Under-Optimized Region Refinement and Progressive Post-Training

PaddleOCR-VL-1.6:在区域优化不足和渐进式后训练下拓展文档解析的前沿

EaDex: A Cross-Embodiment Dexterous Manipulation Framework from Low-Cost Demonstrations

EaDex:低成本演示中的交叉身体灵巧操作框架

GPU-Parallel Multi-Task Reinforcement Learning with Demonstration Guided Policy Optimization

GPU并行多任务强化学习,演示引导策略优化

Local Guidance, Global Impact: Gaussian-Reshaped Trust Region Unlocks Behavior Transitions

本地指导,全球影响:高斯重塑信任区域解锁行为转变

PerchRL: Vision-Based Agile Perching on Inclined Platforms under Rapid and Irregular Motion

PerchRL:基于视觉的敏捷栖息,在快速且不规则运动下的倾斜平台上

Learn from Your Mistakes: Tree-like Self-Play for Secure Code LLMs

从错误中学习:安全代码大型语言模型的树状自玩

ThoughtFold: Folding Reasoning Chains via Introspective Preference Learning

ThoughtFold:通过内省偏好学习折叠推理链条

Post-Hoc Robustness for Model-Based Reinforcement Learning

基于模型的强化学习的事后鲁棒性

Eliciting Complex Spatial Reasoning in MLLMs through Wide-Baseline Matching

通过宽基线匹配诱导MLLM中的复杂空间推理

Exploiting Verification-Generation Gap: Test-Time Reinforcement Learning with Confidence-Conditioned Verification

利用验证-生成差距:基于置信条件验证的测试时间强化学习

Multi$^2$: Hierarchical Multi-Agent Decision-Making with LLM-Based Agents in Interactive Environments

Multi$^2$:基于LLM的智能体在交互环境中进行层级多智能体决策

When are supercapacitors practically feasible in electric vehicles?

超级电容器在电动汽车中什么时候才算可行?

Tool-Aware Optimization with Entropy Guidance for Efficient Agentic Reinforcement Learning

工具感知优化与熵指导,实现高效的代理强化学习

Trading Human Curation for Synthetic Augmentation in RLVR

在RLVR中用合成增强替代人类策划

Easy-to-Use Shielding for Reinforcement Learning

易用屏蔽用于强化学习

EvoDS: Self-Evolving Autonomous Data Science Agent with Skill Learning and Context Management

EvoDS:具备技能学习和上下文管理的自我演进自主数据科学代理

Entropy Is Not Enough: Unlocking Effective Reinforcement Learning for Visual Reasoning via Vision-Anchored Token Selection

熵还不够:通过视觉锚定的代币选择解锁视觉推理的有效强化学习

Preference-Calibrated Human-in-the-Loop Reinforcement Learning for Robotic Manipulation

偏好校准的机器人操作人机循环强化学习

Using Reward Uncertainty to Induce Diverse Behaviour in Reinforcement Learning

利用奖励不确定性诱导强化学习中的多样化行为

Self-Refining Agentic Reinforcement Learning for Vision-Conditioned UAV Navigation

视觉条件无人机导航的自我精炼智能强化学习

Agentic Chain-of-Thought Steering for Efficient and Controllable LLM Reasoning

智能思维链引导,实现高效且可控的大型语言模型推理

QUBRIC: Co-Designing Queries and Rubrics for RL Beyond Verifiable Rewards

QUBRIC:联合设计强化学习超越可验证奖励的查询和评分标准

Language Models Need Sleep: Learning to Self-Modify and Consolidate Memories

语言模型需要睡眠:学习自我修改和巩固记忆

Skill-RM: Unifying Heterogeneous Evaluation Criteria via Agent Skill

Skill-RM:通过代理技能统一异质评估标准

Keyword: diffusion policy

There is no result