生成时间: 2026-05-28 19:41:56 (UTC+8); Arxiv 发布时间: 2026-05-28 20:00 EDT (2026-05-29 08:00 UTC+8)

今天共有 64 篇相关文章

Keyword: reinforcement learning

Personalized Observation Normalization for Federated Reinforcement Learning in Simulation Environments with Heterogeneity

在具有异构性的模拟环境中实现联合强化学习的个性化观察归一化

Differentiable Model Predictive Safety for Heterogeneous Mobility at Urban Intersections

城市交叉口异质出行的可微分模型预测安全性

SCALE-COMM: Shared, Contrastively-Aligned Latent Embeddings for MARL Communication

SCALE-COMM:用于MARL通信的共享、对比对比的潜在嵌入

Transferable Reinforcement Learning via Probabilistic Latent Embeddings and Dynamic Policy Adaptation for Sim-to-Real Deployment

通过概率潜在嵌入和动态策略适应实现模拟到真实部署的可转移强化学习

ReverseMath: Answer Inversion for Scalable and Verifiable Mathematical Problem Generation

ReverseMath:可扩展且可验证数学问题生成的答案反演

Bayesian Deployment Approval for Learned Landing Controllers under Finite Rollout Validation

有限部署验证下学习着陆控制器的贝叶斯部署批准

Explicit Critic Guidance for Aligning Diffusion Models

扩散模型对齐的明确批评指导

Escape the Language Prior: Mitigating Late-Stage Modality Collapse in Audio Reasoning via Modality-Aware Policy Optimization

逃离先前的语言:通过模态感知策略优化缓解音频推理中后期阶段的模态崩溃

Restoring the Sweet Spot: Pass-Rate Weighted Self-Distillation for LLM Reasoning

恢复最佳点:通过率加权自蒸馏为LLM推理

Playing with Words, Improving with Rewards: Training Language Models for Creative Association

玩弄文字,奖励提升:创造性联想的语言模型训练

Reward Transfer from Inverse Reinforcement Learning: A Coupled Minimax Approach

逆向强化学习中的奖励转移:一种耦合极小极大方法

EAPO: Entropy-Driven Adaptive Positive-Negative Sample Weighting for Policy Optimization in Open-Ended QA

EAPO:开放式质量保证中策略优化的熵驱动自适应正负样本加权

C-MIG: Multi-view Information Gain-based Retrieval-Augmented Generation for Clinical Diagnosis Reasoning

C-MIG:基于多视图信息增益的检索增强生成,用于临床诊断推理

MERIT: Matching Expertise via Rubric-Informed Training for Reviewer Assignment

优异:通过评分标准指导培训匹配专业能力,助评审任务

Towards Faithful Agentic XAI: A Verification Method and an Open-World Benchmark for Better Model Faithfulness

迈向忠实的智能体XAI:一种验证方法与更佳模型忠实性的开放世界基准

SKILLC: Learning Autonomous Skill Internalization in LLM Agents via Contrastive Credit Assignment

SKILLC:通过对比学分赋值学习LLM代理的自主技能内化

Decoupled Training with Local Reinforcement Fine-Tuning in Federated Learning

联邦学习中的解耦训练与局部强化微调

S-Cheetah: A Novel Quadrupedal Robot with a 3-DOF Active Spine Learning Agile Locomotion

S-Cheetah:一款新型四足机器人,配备3自由度主动脊柱学习敏捷运动

GeneralThinker: Domain-General Reasoning through Likelihood-Guided Answer-Conditioned Optimization

通用思维者:通过似然引导的答案条件优化实现领域通用推理

Cyclical Entropy Eruption: Entropy Dynamics in Agent Reinforcement Learning

周期性熵喷发:智能体强化学习中的熵动力学

Mags-RL: Wearing Multimodal LLMs a Magnifying Glass via Agentic Reinforcement Learning For Complex Scene Reasoning

Mags-RL:佩戴多模态大型语言模型,通过智能强化学习实现复杂场景推理的放大镜

ABot-OCR Technical Report

ABot-OCR技术报告

Zipping the Thought: When and How Compressed Reasoning Data Works in LLM Post-Training

简化思路:压缩推理数据在LLM后期训练中何时以及如何运作

Beyond pass@k: Redundancy-Aware RLVR for Multi-Sample Code Generation

超越pass@k:冗余感知RLVR用于多采样代码生成

ZipRL: Adaptive Multi-Turn Context Compression with Hindsight Response Replay

ZipRL:自适应多回合上下文压缩与事后视角响应回放

Bridging the Detection-to-Abstention Gap in Reasoning Models under Insufficient Information

在信息不足情况下,推理模型中检测到放弃之间的差距如何弥合

StoryLens: Preference-Aligned Story Rewriting via Context-Aware Narrative Enrichment

StoryLens:通过上下文感知叙事丰富实现的偏好对齐故事重写

Training Stratigraphy: Persistent Behavioral Artifacts in Large Language Models Observed Through Longitudinal AI-Human Interaction

训练地层学:通过纵向人工智能-人类交互观察到的大型语言模型中持续存在的行为伪影

Long Live The Balance: Information Bottleneck Driven Tree-based Policy Optimization

平衡万岁:信息瓶颈驱动的基于树的策略优化

Adaptive Coarse-to-Fine Subgoal Refinement for Long-Horizon Offline Goal-Conditioned Reinforcement Learning

自适应粗到细子目标细化,用于长期离线目标条件强化学习

Deconstructing Spatial Complexity: Hierarchical Decomposition for LLM Spatial Reasoning

解构空间复杂性:LLM空间推理中的层级分解

Off-Policy Learning to Reason Works Because It Is More Pessimistic Than You Think

非策略式推理之所以有效,是因为它比你想象的更悲观

OccuReward: LLM-Guided Occupant-Centric Reward Shaping for Demographic Equity in Grid-Interactive Buildings

OccuReward:基于大型语言模型(LLM)引导的以居住者为中心的奖励塑造,促进网格交互建筑中的人口公平

Joint Training of Multi-Token Prediction in Reinforcement Learning via Optimal Coefficient Calibration

通过最优系数校准对强化学习中多词符预测的联合训练

Visualizing Latent Phase Structures in Locomotion Policies: A Multi-Environment Study with Temporal Feature Extension

运动政策中潜在阶段结构的可视化:一项多环境研究,结合时间特征扩展

ProgVLA: Progress-Aware Robot Manipulation Skill Learning

ProgVLA:进步感知机器人操控技能学习

PIRS: Physics-Informed Reward Shaping for SAC-Based Building Energy Management

PIRS:基于SAC的建筑能源管理的物理知情奖励塑造

IRDS: Interpretable RLVR Data Selection via Verifier-Coupled Sparse Autoencoder Coverage

IRDS:通过验证器耦合稀疏自编码覆盖实现可解释的RLVR数据选择

EchoAvatar: Real-time Generative Avatar Animation from Audio Streams

EchoAvatar:音频流实时生成化身动画

Global Policy-Space Response Oracles for Two-Player Zero-Sum Games

双人零和博弈的全球政策空间响应预言机

Commit to the Bit: Reactive Reinforcement Learning Done Right

承诺执行:正确完成的反应强化学习

AtomComposer: Discovering Chemical Space from First Principles with Reinforcement Learning

AtomComposer:从第一原理出发,通过强化学习发现化学空间

ProRL: Effective Reinforcement Learning for Proactive Recommendation via Rectified Policy Gradient Estimation

ProRL:通过纠正策略梯度估计实现主动推荐的有效强化学习

Where Rollouts Begin: Low-Load, High-Leverage First-Token Diversification for RLVR

推广开始:低负载、高杠杆的RLVR第一代币多元化

Plan Before Search: Search Agents Need Plan

搜索前规划:搜索代理人需要计划

Teacher-Student Representational Alignment for Reinforcement Learning-Driven Imitation Learning

师生表征对齐以强化学习为驱动的模仿学习

Mechanistically Interpreting the Role of Sample Difficulty in RLVR for LLMs

机制性解读样本难度在 RLVR 中对大型语言模型的作用

Efficient Post-training of LLMs for Code Generation With Offline Reinforcement Learning

高效后期训练LLM用于代码生成,结合离线强化学习

DenoiseRL: Bootstrapping Reasoning Models to Recover from Noisy Prefixes

DenoiseRL:引导推理模型以从噪声前缀中恢复

Skill0.5: Joint Skill Internalization and Utilization for Out-of-Distribution Generalization in Agentic Reinforcement Learning

Skill0.5:联合技能内化与利用,用于代理强化学习中的分布外泛化

Learning a Kinodynamic Trajectory Manifold for Impact-Aware Compliant Catching of Fast-Moving Objects

学习运动动力学轨迹流形,以实现冲击感知且符合快速移动物体的捕捉

GUI-CIDER: Mid-training GUI Agents via Causal Internalization and Density-aware Exemplar Reselection

GUI-CIDER:通过因果内化和密度感知范例重新选择的中期训练GUI代理

Modeling Vehicle-Type-Specific Pedestrian Crash Avoidance Behavior in Safety-Critical Interactions Using Smooth-Mamba Deep Reinforcement Learning

利用平滑曼巴深度强化学习建模安全关键交互中车辆类型特定的行人碰撞规避行为

Soft-SVeRL: Self-Verified Reinforcement Learning with Soft Rewards

Soft-SVeRL:带有软奖励的自我验证强化学习

SARAD: LLM-Based Safety-Aware Hybrid Reinforcement Learning with Collision Prediction for Autonomous Driving

SARAD:基于大语言模型的安全感知混合强化学习,具碰撞预测,用于自动驾驶

Single-Rollout Hidden-State Dynamics for Training-Free RLVR Data Selection

单次展开隐藏状态动态用于无训练RLVR数据选择

Optimal Data Acquisition for Reinforcement Learning: A Large Deviations Perspective

强化学习的最优数据采集:大偏差视角

OSP-Next: Efficient High-Quality Video Generation with Sparse Sequence Parallelism, HiF8 Quantization, and Reinforcement Learning

OSP-Next:高效高质量视频生成,采用稀疏序列并行性、HiF8量化和强化学习

TRACER: Turn-level Regret Matching with Inner Reinforcement Credit for Cooperative Multi-LLM Reasoning

追踪者:回合级遗憾匹配与内在强化学分,用于合作多大型语言模型推理

AlphaTransit: Learning to Design City-scale Transit Routes

AlphaTransit:学习设计城市规模的交通线路

OmniVerifier-M1: Multimodal Meta-Verifier with Explicit Structured Recalibration

OmniVerifier-M1:带有显式结构重校准的多模态元验证器

Beyond Binary: Sim-to-Real Dexterous Manipulation with Physics-Grounded Contact Representation

超越二进制:模拟到现实的灵巧操作与物理接点表示

Keyword: diffusion policy

How VLAs Fail Differently: Black-Box Action Monitoring Reveals Architecture-Specific Failure Signatures

VLA失效方式的不同:黑箱动作监控揭示架构特定的故障特征

Imitation Learning for Robot Assistance in Open Surgery: A Multi-Policy Evaluation on Suture Following

开放手术机器人辅助的模仿学习:缝线后多策略评估