生成时间: 2026-07-03 18:38:13 (UTC+8); Arxiv 发布时间: 2026-07-03 20:00 EDT (2026-07-04 08:00 UTC+8)

今天共有 34 篇相关文章

Keyword: reinforcement learning

WaveLander: A Generalizable Hierarchical Control Framework for UAV Landing on Wave-Disturbed Platforms via Reinforcement Learning

WaveLander:一种可通用的分层控制框架,通过强化学习实现无人机在波扰平台上着陆

Simulation Based Reward Function Validation for Multi-Agent On Orbit Inspection

基于仿真的多智能体轨道检测奖励函数验证

The Rollout Infrastructure Tax in Coding-Agent Reinforcement Learning

编码代理强化学习中的基础设施税推广

FaithMed: Training LLMs For Faithful Evidence-Based Medical Reasoning

FaithMed:培训LLMs以忠实循证医学推理

Beyond Next-Token Prediction: An RLVR Proof of Concept for Tool-Use Agents on Atlassian Workflows

超越下一个令牌预测:Atlassian工作流中工具使用代理的RLVR概念验证

Procedural Memory Distillation: Online Reflection for Self-Improving Language Models

程序性记忆提炼:自我提升语言模型的在线反思

Don't Let Gains FADE: Breaking Down Policy Gradient Weights in RL

别让收益消逝:拆解强化环境中的政策梯度权重

Wind-Aware Reinforcement Learning Control of a Small Quadrotor Using Learned Onboard Wind Estimation in Simulated Atmospheric Turbulence

利用机载风速估算学习小型四旋翼机的风知强化学习控制,模拟大气湍流

Safe and Adaptive Cloud Healing: Verifying LLM-Generated Recovery Plans with a Neural-Symbolic World Model

安全且自适应的云修复:用神经符号世界模型验证LLM生成的恢复计划

Scaling with Confidence: Calibrating Confidence of LLMs for Adaptive Test Time Scaling

信心扩展:校准LLMs的自适应测试时间扩展信心

One Demonstration Is Enough for Real-World Robotic Reinforcement Learning

一次演示就足以实现现实世界的机器人强化学习

CoRe: Combined Rewards with Vision-Language Model Feedback for Preference-Aligned Reinforcement Learning

CoRe:结合奖励与视觉语言模型反馈,实现偏好对齐的强化学习

DRL-CLBA: A Clean Label Backdoor Attack for Speech Classification via DDPG Reinforcement Learning

DRL-CLBA:通过DDPG强化学习进行语音分类的清洁标签后门攻击

Denser $\neq$ Better: Limits of On-Policy Self-Distillation for Continual Post-Training

更密集的$\neq$ 更好:持续后培训中政策自我提炼的限制

Lightweight Safe Reinforcement Learning for End-to-End UAV Navigation

轻量级安全强化学习,用于端到端无人机导航

Many Voices, One Reward: Multi-Role Rubric Generation for LLM Judging and Reward Modeling

多声音,一奖:多角色评分标准生成用于LLM评判与奖励建模

Decomposer: Learning to Decompile Symbolic Music to Programs

分解器:学习将符号音乐反编译到程序中

Learning the Supports for Categorical Critic in Reinforcement Learning

学习强化学习中对类别批判的支持

Rank-Then-Act: Reward-Free Control from Frame-Order Progress

排名后行动:通过帧顺序进度实现无奖励控制权

SPLC: Social Preference Learning for Crowd Robot Navigation

SPLC:人群机器人导航中的社会偏好学习

TUDUM: A Turkish-Thinking Reasoning Pipeline for Qwen3.5-27B

图杜姆:Qwen3.5-27B的土耳其思维推理管道

Cross-Platform Control for Autonomous Surface Vehicles via Adaptive Reinforcement Learning

通过自适应强化学习实现自动地面车辆的跨平台控制

Evidence-State Rewards for Long-Context Reasoning

长语境推理的证据状态奖励

Enhancing Fitness Intelligence through Domain-Specific LLM Post-Training

通过领域特定LLM后培训提升健身智能

ART for Diffusion Sampling: Continuous-Time Control and Actor-Critic Learning

扩散采样的ART:连续时间控制与演员-批评者学习

Actuator Reality Shaping for Zero-Shot Sim-to-Real Robot Learning

零样品模拟到真实机器人学习的执行器现实塑形

DetailAnywhere: Fashion Detail Generation via Cross-Modal Feature Alignment Distillation

DetailAnywhere:通过跨模态特征对齐提炼生成时尚细节

Generalization in offline RL: The structure is more important than the amount of pessimism

离线强化学习中的泛化:结构比悲观的程度更重要

Optimizing Visual Generative Models via Distribution-wise Rewards

通过分布式奖励优化可视化生成模型

DecompRL: Solving Harder Problems by Learning Modular Code Generation

DecompRL:通过学习模块化代码生成解决更难的问题

WorldSample: Closed-loop Real-robot RL with World Modelling

WorldSample:闭环真实机器人强化学习与世界建模

Learning Agile Intruder Interception using Differentiable Quadrotor Dynamics

学习使用可微四旋翼动力学进行敏捷入侵者拦截

Visually Grounded Self-Reflection for Vision-Language Models via Reinforcement Learning

视觉语言模型的视觉基础自我反思,通过强化学习

Seek to Segment: Active Perception for Panoramic Referring Segmentation

寻求细分:主动感知用于全景指涉细分

Keyword: diffusion policy

There is no result