生成时间: 2026-06-05 19:31:55 (UTC+8); Arxiv 发布时间: 2026-06-05 20:00 EDT (2026-06-06 08:00 UTC+8)

今天共有 47 篇相关文章

Keyword: reinforcement learning

Improving Heart-Focused Medical Question Answering in LLMs via Variance-Aware Rubric Rewards with GRPO

通过GRPO的方差感知评分标准奖励提升大型语言模型中以心脏为中心的医学问答能力

A New Quaternion-Joint Cable-Driven Redundant Manipulator Configuration and its Control Through FABRIK and Residual Reinforcement Learning

一种新的四元数-连接电缆驱动冗余机械臂配置及其通过FABRIK和残差强化学习的控制

Inverse Manipulation through Symbolic Planning and Residual Operator Learning

通过符号规划和残差算子学习实现逆向操作

Alpha-RTL: Test-Time Training for RTL Hardware Optimization

Alpha-RTL:RTL 硬件优化的测试时训练

Policy-Conditioned Counterfactual Credit for Verifiable Reinforcement Learning of Long-Horizon Language Agents

对可验证强化学习的策略条件反事实学分

Agentic Monte Carlo: Simulating Reinforcement Learning for Black-Box Agents

智能蒙特卡洛:针对黑箱智能体的强化学习模拟

Recovering Physically Plausible Human-Object Interactions from Monocular Videos

从单眼视频中恢复物理上合理的人与物互动

SHALA-LLM: Smartly Handling Ambiguous Labels in Aligning LLMs

SHALA-LLM:智能处理对齐大型语言模型中的模糊标签

MoDex: A Diffusion Policy for Sequential Multi-Object Dexterous Grasping

MoDex:顺序多对象灵巧抓取的扩散策略

Selective-Advantage Entropy-Adaptive Horizon GRPO: Asymmetric Token-Level Discounting for Efficient Reinforcement Learning of Language Models

选择优势熵-自适应视野GRPO:非对称令牌级折扣以实现语言模型高效强化学习

Robust Scene Transfer for PointGoal Navigation via Privileged Sensor Guided Contrastive Learning

通过特权传感器引导对比学习实现点目标导航的稳健场景传输

Representation Learning Enables Scalable Multitask Deep Reinforcement Learning

表征学习实现可扩展的多任务深度强化学习

BMCR: Adaptive Backbone Module Composition via Reinforcement Learning for Remote Sensing Object Detection

BMCR:通过强化学习实现自适应骨干模块组合以实现遥感对象检测

Cross-Epoch Adaptive Rollout Optimization for RL Post-Training

跨时代自适应推广优化用于强化学习后训练

Safety Paradox: How Enhanced Safety Awareness Leaves LLMs Vulnerable to Posterior Attack

安全悖论:增强的安全意识如何使大型语言模型(LLM)容易遭受后方攻击

QueryAgent-R1: Bridging Query Generation and Product Retrieval for E-Commerce Query Recommendation

QueryAgent-R1:连接查询生成与产品检索,用于电子商务查询推荐

Accelerating and Scaling MPC-Guided Reinforcement Learning for Humanoid Locomotion and Manipulation

加速和扩展MPC引导强化学习,用于类人机动与操控

When AI Says It Feels

当AI说的时候,感觉

SALT: When More Rollouts Don't Help in Group-Based Policy Optimization and How to Make Them Matter

SALT:当更多的推广无法帮助基于群体的策略优化,以及如何让它们变得有意义

EEGDancer: Dynamic Emotion Latent Space Masked Modeling with Reinforcement Learning for EEG Continuous Emotion Prediction

EEGDancer:动态情绪潜在空间蒙面建模与强化学习,用于脑电连续情绪预测

TARPO: Token-Wise Latent-Explicit Reasoning via Action-Routing Policy Optimization

TARPO:通过动作-路由策略优化实现的令牌级潜在-显式推理

Exploring cooperation mechanisms via reinforcement learning in network common-pool resource games

探索网络公共资源博弈中通过强化学习的合作机制

LadderMan: Learning Humanoid Perceptive Ladder Climbing

梯子人:学习类人生物感知梯子攀登

TAGA: Terrain-aware Active Gaze Learning for Generalizable Agile Humanoid Locomotion

TAGA:地形感知主动凝视学习,实现可通用的敏捷类人移动

When Denser Credit Is Not Enough: Evidence-Calibrated Policy Optimization for Long-Horizon LLM Agent Training

当更密集的信用还不够时:长期视野LLM代理培训的证据校准策略优化

ACE-SQL: Adaptive Co-Optimization via Empirical Credit Assignment for Text-to-SQL

ACE-SQL:通过经验学分作业实现文本转SQL的自适应协同优化

Better Literary Translation: A Multi-Aspect Data Generation and LLM Training Approach

更好的文学翻译:多方面数据生成与大型语言模型训练方法

A Pre-Registered Causal Partition of Self-Consistency Elicitation and Reward Design in RLVR

RLVR中自洽诱导与奖励设计的预注册因果划分

Edit-R2: Context-Aware Reinforcement Learning for Multi-Turn Image Editing

编辑-R2:多回合图像编辑的上下文感知强化学习

Merging model-based control with multi-agent reinforcement learning for multi-agent cooperative teaming strategies

将基于模型的控制与多智能体强化学习相结合,实现多智能体合作团队策略

L-SDPPO: Policy Optimization of Spiking Diffusion Policy for Intra-vehicular Robotic Manipulation

L-SDPPO:车内机器人操作中尖峰扩散政策的策略优化

Online KL-Regularized Reinforcement Learning with Function Approximation under Misspecification

在线KL正则化强化学习,含功能近似,针对错误指定

MDP-GRPO: Stabilized Group Relative Policy Optimization for Multi-Constraint Instruction Following

MDP-GRPO:多约束指令跟随的稳定组相对策略优化

On Advantage Estimates for Max@K Policy Gradients

关于Max@K政策梯度的优势估计

Adaptive state-action abstractions via rate-distortion

通过速率失真实现自适应状态-动作抽象

MotionDisco: Motion Discovery for Extreme Humanoid Loco-Manipulation

MotionDisco:极限类人机车操控的运动发现

Learning to replenish: A hybrid deep reinforcement learning for dynamic inventory management in the pharmaceutical supply chains

学习补货:一种用于制药供应链动态库存管理的混合深度强化学习

DisasterBench: A Multimodal Benchmark for UAV-Based Disaster Response in Complex Environments

DisasterBench:复杂环境中无人机灾害响应的多模态基准测试

SecRL-Prune: Structured Reinforcement Learning-Based Pruning of CodeLLMs for Preserving Adversarial Code Mutation

SecRL-Prune:基于结构化强化学习的代码LLMs剪枝,用于保留对抗性代码变异

EDIT: Evidence-Diagnosed Intervention Training for Rule-Faithful LLM Grading

编辑:基于证据的干预培训,用于规则忠实的LLM评分

Maximising the Set-Piece Return: Optimising Football Corner Tactics with Graph Reinforcement Learning

最大化定位球回报:利用图强化学习优化橄榄球角球战术

Emergent Language as an Approach to Conscious AI

涌现语言作为有意识人工智能的方法

Reinforcement Learning Elicits Contextual Learning of Unseen Language Translation

强化学习引发了看不见的语言翻译的情境学习

RREDCoT: Segment-Level Reward Redistribution for Reasoning Models

RREDCoT:推理模型的分段级奖励再分配

TempoVLA: Learning Speed-Controllable Vision-Language-Action Policies

TempoVLA:学习可控速度的视觉-语言-行动策略

Keyword: diffusion policy

MoDex: A Diffusion Policy for Sequential Multi-Object Dexterous Grasping

MoDex:顺序多对象灵巧抓取的扩散策略

L-SDPPO: Policy Optimization of Spiking Diffusion Policy for Intra-vehicular Robotic Manipulation

L-SDPPO:车内机器人操作中尖峰扩散政策的策略优化