生成时间: 2026-06-30 18:59:55 (UTC+8); Arxiv 发布时间: 2026-06-30 20:00 EDT (2026-07-01 08:00 UTC+8)

今天共有 65 篇相关文章

Keyword: reinforcement learning

Multi-Agent DRL for QoS and Energy Optimization in RIS-Enabled Open-RAN Industrial 6G TN/NTN Networks

多智能体DRL用于RIS支持的开放式RAN工业6G TN/NTN网络中的QoS和能源优化

RADIANT-PET: Reasoning-Augmented PET/CT Lesion Segmentation with Large Language Models and Reinforcement Learning

RADIANT-PET:基于大型语言模型和强化学习的推理增强PET/CT病灶分割

Reinforcement Learning for Software Vulnerability Analysis: A Systematic Review with Emphasis on C/C++ Source Code and Static Analysis

软件漏洞分析的强化学习:系统综述,重点为C/C++源代码和静态分析

Position: RL Researchers Need to Distinguish Between Solving Simulators and Using Simulators as a Proxy

位置:强化学习的研究人员需要区分求解模拟器和使用模拟器作为代理

Dockerless: Environment-Free Program Verifier for Coding Agents

Dockerless:编程代理的无环境程序验证器

R$^2$-Searcher: Calibrating Retrieval and Reasoning Boundaries for Agentic Search

R$^2$-搜索器:校准代理搜索的检索与推理边界

Neuromorphic Energy-Aware Learning for Adaptive Deep Brain Stimulation

神经形态能量感知学习用于适应性深脑刺激

Entropy Regularized Reinforcement Learning for Zero-Sum Stochastic Differential Games in a Regime-Switching Jump-Diffusion Process

熵正则化强化学习在状态切换跳扩散过程中的零和随机微分博弈

Entropy-Regularized Reinforcement Learning for Linear-Quadratic Stackelberg Differential Games in Regime-Switching Diffusion Models

熵正则化强化学习在态态切换扩散模型中的线性二次斯塔克伯格微分博弈

An AI agent for treatment reasoning over a biomedical tool universe

一个用于治疗推理的人工智能代理,超越生物医学工具宇宙

BV-Blend: Uncertainty-Weighted Historical Baselines for Stable Critic-Free RL with Verifiable Rewards

BV-Blend:稳定无批评的历史基线,奖励可验证

Hierarchical Decision Making with Structured Policies: A Principled Design via Inverse Optimization

结构化策略的层级决策:通过逆优化实现的原则性设计

Physics Models for Sim-to-Real Transfer in Professional-Level Robot Table Tennis

职业级机器人乒乓球模拟到真实转移的物理模型

Q-DASC: State-of-the-Art Safe Quantum Control for HVAC under Local Model Misspecification

Q-DASC:局部模型错误规定下最先进的暖通空调安全量子控制

A3M: Adaptive, Adversarial and Multi-Objective Learning for Strategic Bidding in Repeated Auctions

A3M:自适应、对抗和多目标学习,用于重复拍卖中的战略竞价

Modification-Considering Value Learning for Reward Hacking Mitigation in RL

修改考虑价值学习以缓解强化学习中的奖励黑客缓解

Efficient Spatio-Temporal Grounding with Multimodal Large Models via Second-Level Tracking and RL Verification

通过二级跟踪和强化学习验证,高效实现多模态大型模型的时空基础化

Fairness Attacks on Recommender Systems

对推荐系统公平性的攻击

Masked Diffusion Decoding as $x$-Prediction Flow

掩蔽扩散解码作为$x$-预测流

HiComm: Hierarchical Communication for Multi-agent Reinforcement Learning

HiComm:多智能体强化学习的分层通信

GPC: Large-Scale Generative Pretraining for Transferable Motor Control

GPC:可迁移运动控制的大规模生成预训练

OASIF: An Efficient Obfuscation-Aware Self-Improving Framework for LLM-Based Assembly Code Instruction Following and Comprehension

OASIF:一个高效的混淆感知自我改进框架,用于基于LLM的汇编代码指令跟随与理解

MIThinker: A Plug-and-Play Policy-Optimized Thinker For Motivational Interviewing Counseling

MIThinker:一款即插即用的政策优化思维工具,用于动机性访谈咨询

Process Advantage Signal Shaping: A Paradigm-Agnostic Middleware for Process-Supervised RL in LLM Reasoners

过程优势信号整形:一种面向LLM推理器中过程监督强化学习的范式无关中间件

LAMP: Long-Horizon Adaptive Manipulation Planning for Multi-Robot Collaboration in Cluttered Space

LAMP:多机器人协作的长视野自适应操作规划,适用于拥挤空间中的协作

EntroRouter: Learning Efficient Model Routing via Entropy Regulation

EntroRouter:通过熵调控学习高效的模型路由

CRAFT: Counterfactual Credit Assignment from Free Sibling Rollouts for Self-Distilled Agentic Reinforcement Learning

CRAFT:来自免费兄弟姐妹推广的反事实学分作业,用于自我提炼的代理强化学习

To Reason or to Fabricate: Reasoning Without Shortcuts via Hint-Anchored Pairwise Aggregation

推理还是制造:通过提示锚定的两对聚合无捷径推理

UCOB: Learning to Utilize and Evolve Agentic Skills via Credit-Aware On-Policy Bidirectional Self-Distillation

UCOB:通过信用意识的政策双向自我蒸馏,学习利用和发展代理技能

Reinforcement Learning in Super Mario Bros: Curriculum, Pedagogy, and Optimal Level Design in World 1-1

《超级马里奥兄弟》中的强化学习:课程、教学法与世界1-1中的最优关卡设计

The Mirage of Optimizing Training Policies: Monotonic Inference Policies as the Real Objective for LLM Reinforcement Learning

优化训练策略的幻影:单调推断策略作为大型语言模型强化学习的真正目标

Persona-Trained Monte Carlo: Estimating Market-Outcome Distributions via Swarms of Persona-Conditioned Neural Policy Bots in a Limit Order Book

Persona训练的蒙特卡洛:通过限价单簿中大量Persona条件神经策略机器人估算市场结果分布

GUICrafter: Weakly-Supervised GUI Agent Leveraging Massive Unannotated Screenshots

GUICrafter:弱监督的GUI代理,利用大量无注释截图

Why Struggle with Continuous Latents? Interpretable Discrete Latent Reasoning via Rendered Compression

为什么要与持续的潜在情绪挣扎?通过渲染压缩实现可解释的离散潜在推理

PS-PPO: Prefix-Sampling PPO for Critic-Free RLHF

PS-PPO:前缀采样PPO,用于无评论RLHF

MR-IQA: A Unified Margin View of Regression and Ranking for Blind Image Quality Assessment

MR-IQA:盲图质量评估中回归与排名的统一边际视图

SMART-MIG: A Learning Framework for Scalable and Energy-Efficient GPU Scheduling

SMART-MIG:一个可扩展且节能的GPU调度学习框架

Accelerating Q-learning through Efficient Value-Sharing across Actions

通过高效跨行动的价值共享加速Q学习

Consistency as Inductive Bias: Learning Cross-View Invariance for Robust Multimodal Reasoning

一致性作为归纳偏见:学习交叉视图不变性以实现稳健的多模态推理

Dual-Flow Reinforcement Learning with State-Aware Exploration

双流强化学习与状态感知探索

KbSD: Knowledge Boundary aware Self-Distillation for Behavioral Calibration in Agentic Search

KbSD:知识边界感知自我蒸馏,用于智能搜索中的行为校准

RoAd-RL: A Unified Library and Benchmark for Robust Adversarial Reinforcement Learning

RoAd-RL:一个统一的库和强健对抗强化学习的基准

AI Training Manager: Bounded Closed-Loop Control of Adaptive Training Recipes

AI训练管理器:有界闭环控制自适应训练配方

Trust Your Instincts: Confidence-Driven Test-Time RL for Vision-Language-Action Models

相信你的直觉:基于信心的测试时间强化学习,适用于视觉-语言-行动模型

StrucTab: A Structured Optimization Framework for Table Parsing

StrucTab:一个用于表解析的结构化优化框架

Towards Physical Intuitions for Alignment Dynamics: A Case Study With Randomness Crystallization

迈向物理直觉对比动力学:随机性结晶的案例研究

RoamFlow: Reinforcement-Aligned One-Step Action MeanFlow Policy for Image-Goal Navigation

RoamFlow:基于强化对齐的一步动作MeanFlow策略,用于图像目标导航

LatentRevise: Learning from Zero-Hit Reasoning

潜伏修正:从零打击推理中学习

Exploration and Online Transfer with Behavioral Foundation Models

基于行为基础模型的探索与在线转移

Be Faithful When Response: Returning Fluent and Grounded Answers for Vision-Language Models Reinforcement Learning

回答时要忠实:为视觉语言模型强化学习反馈流畅且扎根的答案

ACPO: Agent-Chained Policy Optimization for Multi-Agent Reinforcement Learning

ACPO:多智能体强化学习的代理链式策略优化

Hierarchical Reinforcement Learning in StarCraft Micromanagement with Influence Maps and Cluster-based Scripts

星际争霸中的层级强化学习,微观管理含影响图和基于集群脚本

Domain Adaptation with Adaptive Imagination for Visual Reinforcement Learning under Limited Target Data

在有限目标数据下,利用自适应想象力进行视觉强化学习的领域适应

Sparse Sensor Placement in Multi-Agent Reinforcement Learning Control of Rayleigh-Bénard Convection

多智能体强化学习控制中稀疏传感器布置

KYON: Semi-Modular Wheel-Legged Quadruped With Agile Bimanual Capability

KYON:半模块化轮腿四足,具备灵活双手能力

Toward an Energy-Optimized Operation of Data Centers Located in Wind Farms Using Reinforcement Learning

通过强化学习实现位于风电场的数据中心实现能源优化运行

DRIFT: Difficulty Routing Self-DIstillation with Rhythm-Gated Exploration and Success BuFfer Training

DRIFT:通过节奏门控探索和成功训练难以路由自净化

FlowAWR: Online Adaptive Flow Reinforcement via Advantage-Weighted Rectification

FlowAWR:通过优势加权整流的在线自适应流强化

MOPD: Multi-Teacher On-Policy Distillation for Capability Integration in LLM Post-Training

MOPD:多教师在策略上提炼,用于LLM后培训中的能力集成

Diffusion Fine-tuning with Rewarded Moment Matching Distillation

与奖励时刻匹配蒸馏的扩散微调

Experience Augmented Policy Optimization for LLM Reasoning

LLM推理的增强策略优化经验

Grasp-Oriented Non-Prehensile Manipulation via Learning a Graspability Field

通过学习可抓场实现的以握有为导向的非抓握操作

When and Which Sensor to Observe? Timely Tracking of a Joint Markov Source

何时以及观察哪个传感器?联合马尔可夫源的及时跟踪

Keyword: diffusion policy

Keypose Exploration: Efficient Automatic Trajectory Labelling and Cross-Embodiment Policy Transfer

关键时刻探索:高效的自动轨迹标记与跨实体策略转移

Behavior Uncloning: Distilling Mode Redirection into Policy Weights without Inference-Time Steering

行为解克隆:将模式重定向转化为策略权重,无需推理时间引导