生成时间: 2026-04-14 17:24:56 (UTC+8); Arxiv 发布时间: 2026-04-14 20:00 EDT (2026-04-15 08:00 UTC+8)

今天共有 78 篇相关文章

Keyword: reinforcement learning

Unifying Ontology Construction and Semantic Alignment for Deterministic Enterprise Reasoning at Scale

统一本体构建与语义对齐,以实现大规模确定性企业推理

A Comparative Theoretical Analysis of Entropy Control Methods in Reinforcement Learning

强化学习中熵控制方法的比较理论分析

Belief-Aware VLM Model for Human-like Reasoning

类人推理的信念感知VLM模型

Cayley Graph Optimization for Scalable Multi-Agent Communication Topologies

可扩展多代理通信拓扑的凯莱图优化

Multi-Granularity Reasoning for Image Quality Assessment via Attribute-Aware Reinforcement Learning to Rank

通过属性感知强化学习进行图像质量评估的多粒度推理以进行排名

ExecTune: Effective Steering of Black-Box LLMs with Guide Models

ExecTune:用指导模型有效引导黑盒大型语言模型

Backdoors in RLVR: Jailbreak Backdoors in LLMs From Verifiable Reward

RLVR中的后门:来自可验证奖励的LLM越狱后门

GIANTS: Generative Insight Anticipation from Scientific Literature

巨人:科学文献中的生成洞察预期

Controllable and Verifiable Tool-Use Data Synthesis for Agentic Reinforcement Learning

可控且可验证的工具使用数据综合用于智能强化学习

Instructing LLMs to Negotiate using Reinforcement Learning with Verifiable Rewards

指导大型语言模型使用带有可验证奖励的强化学习进行谈判

Deep Reinforcement Learning for Cognitive Time-Division Joint SAR and Secure Communications

认知时分联合搜救与安全通信的深度强化学习

Self-Distilled Reinforcement Learning for Co-Evolving Agentic Recommender Systems

共进化智能体推荐系统中的自我提炼强化学习

When Can You Poison Rewards? A Tight Characterization of Reward Poisoning in Linear MDPs

什么时候可以毒奖励?线性多重药中奖赏中毒的精致表征

ASPIRin: Action Space Projection for Interactivity-Optimized Reinforcement Learning in Full-Duplex Speech Language Models

阿司匹林:全双工语音语言模型中互动优化强化学习的行动空间投影

Trust Your Memory: Verifiable Control of Smart Homes through Reinforcement Learning with Multi-dimensional Rewards

信任你的记忆:通过强化学习实现智能家居的可验证控制,并伴随多维奖励

MoRI: Mixture of RL and IL Experts for Long-Horizon Manipulation Tasks

MoRI:强化学习与国际逻辑专家的结合,用于长期视野操控任务

MAVEN-T: Multi-Agent enVironment-aware Enhanced Neural Trajectory predictor with Reinforcement Learning

MAVEN-T:多智能体环境感知增强神经轨迹预测器,结合强化学习

Warm-Started Reinforcement Learning for Iterative 3D/2D Liver Registration

热启动强化学习用于迭代3D/2D肝脏注册

A Dual-Positive Monotone Parameterization for Multi-Segment Bids and a Validity Assessment Framework for Reinforcement Learning Agent-based Simulation of Electricity Markets

多段竞价的双正单调参数化和基于强化学习的电力市场模拟有效性评估框架

A Queueing-Theoretic Framework for Dynamic Attack Surfaces: Data-Integrated Risk Analysis and Adaptive Defense

动态攻击面的排队理论框架:数据集成风险分析与自适应防御

SWE-Shepherd: Advancing PRMs for Reinforcing Code Agents

SWE-Shepherd:推进PRM以加强代码特工

Beyond Compliance: A Resistance-Informed Motivation Reasoning Framework for Challenging Psychological Client Simulation

超越合规:一种基于抗拒的动机推理框架,用于挑战心理客户模拟

Simple but Stable, Fast and Safe: Achieve End-to-end Control by High-Fidelity Differentiable Simulation

简单但稳定、快速且安全:通过高保真可微分仿真实现端到端控制

Calibration Collapse Under Sycophancy Fine-Tuning: How Reward Hacking Breaks Uncertainty Quantification in LLMs

谄媚微调下的校准崩溃:奖励黑客如何破坏大型语言模型中的不确定性量化

AWARE: Adaptive Whole-body Active Rotating Control for Enhanced LiDAR-Inertial Odometry under Human-in-the-Loop Interaction

AWARE:在人机交互下增强激光雷达惯性里程计的自适应全身主动旋转控制

On the Optimization Landscape of Observer-based Dynamic Linear Quadratic Control

关于基于观察者的动态线性二次控制的优化景观

Preference-Agile Multi-Objective Optimization for Real-time Vehicle Dispatching

实时车辆调度的偏好敏捷多目标优化

Skill-SD: Skill-Conditioned Self-Distillation for Multi-turn LLM Agents

Skill-SD:多回合大型语言模型代理的技能条件自蒸馏

FedRio: Personalized Federated Social Bot Detection via Cooperative Reinforced Contrastive Adversarial Distillation

FedRio:通过协作式强化对比对抗蒸馏实现个性化联邦社交机器人检测

SCOPE: Signal-Calibrated On-Policy Distillation Enhancement with Dual-Path Adaptive Weighting

范围:带双路径自适应加权的信号校准策略上蒸馏增强

Bringing Value Models Back: Generative Critics for Value Modeling in LLM Reinforcement Learning

重现价值模型:对LLM强化学习中价值建模的生成式批评

Learning Preference-Based Objectives from Clinical Narratives for Sequential Treatment Decision-Making

从临床叙事中学习基于偏好的目标,以实现顺序治疗决策

TInR: Exploring Tool-Internalized Reasoning in Large Language Models

TInR:探索大型语言模型中的工具内化推理

Advancing Polish Language Modeling through Tokenizer Optimization in the Bielik v3 7B and 11B Series

通过分词器优化推动Bielik v3 7B和11B系列中波兰语言建模的发展

Adaptive Bounded-Rationality Modeling of Early-Stage Takeover in Shared-Control Driving

共享控制驾驶中早期阶段接管的自适应有界有理性建模

PokeRL: Reinforcement Learning for Pokemon Red

PokeRL:宝可梦红的强化学习

CheeseBench: Evaluating Large Language Models on Rodent Behavioral Neuroscience Paradigms

CheeseBench:评估啮齿动物行为神经科学范式中的大型语言模型

EvoNash-MARL: A Closed-Loop Multi-Agent Reinforcement Learning Framework for Medium-Horizon Equity Allocation

EvoNash-MARL:一个用于中期股权配置的闭环多智能体强化学习框架

CSPO: Alleviating Reward Ambiguity for Structured Table-to-LaTeX Generation

CSPO:消除结构化表到LaTeX生成的奖励模糊性

Diffusion Reinforcement Learning Based Online 3D Bin Packing Spatial Strategy Optimization

基于在线3D箱装箱空间策略优化的扩散强化学习

ScoRe-Flow: Complete Distributional Control via Score-Based Reinforcement Learning for Flow Matching

ScoRe-Flow:通过基于评分的强化学习实现完整的分布控制,实现流量匹配

You Only Judge Once: Multi-response Reward Modeling in a Single Forward Pass

你只评判一次:单次前传中的多重响应奖励建模

MMR-AD: A Large-Scale Multimodal Dataset for Benchmarking General Anomaly Detection with Multimodal Large Language Models

MMR-AD:一个用于用多模态大型语言模型进行一般异常检测基准的大规模多模态数据集

Robust Adversarial Policy Optimization Under Dynamics Uncertainty

动态不确定性下的强健对抗策略优化

When Valid Signals Fail: Regime Boundaries Between LLM Features and RL Trading Policies

当有效信号失效时:LLM特性与强化学习交易策略之间的制度边界

NimbusGuard: A Novel Framework for Proactive Kubernetes Autoscaling Using Deep Q-Networks

NimbusGuard:利用深度Q网络进行主动Kubernetes自动扩展的创新框架

Optimal Stability of KL Divergence under Gaussian Perturbations

高斯微扰下的KL散度的最优稳定性

RTMC: Step-Level Credit Assignment via Rollout Trees

RTMC:通过推广树进行阶级学分分配

Rethinking Token-Level Credit Assignment in RLVR: A Polarity-Entropy Analysis

重新思考RLVR中的代币级信用分配:极性熵分析

OmniScript: Towards Audio-Visual Script Generation for Long-Form Cinematic Video

OmniScript:迈向长篇电影视频的视听剧本生成

MADQRL: Distributed Quantum Reinforcement Learning Framework for Multi-Agent Environments

MADQRL:多智能体环境分布式量子强化学习框架

AIM: Intent-Aware Unified world action Modeling with Spatial Value Maps

AIM:意图感知统一世界动作建模与空间值映射

From Answers to Arguments: Toward Trustworthy Clinical Diagnostic Reasoning with Toulmin-Guided Curriculum Goal-Conditioned Learning

从答案到论证:迈向托尔明指导课程的可信临床诊断推理目标条件学习

ViserDex: Visual Sim-to-Real for Robust Dexterous In-hand Reorientation

ViserDex:视觉模拟到现实,实现灵活的手部重新定位

HiEdit: Lifelong Model Editing with Hierarchical Reinforcement Learning

HiEdit:终身模型编辑与层级强化学习

The Past Is Not Past: Memory-Enhanced Dynamic Reward Shaping

过去不是过去:记忆增强的动态奖励塑造

Low-rank Optimization Trajectories Modeling for LLM RLVR Acceleration

LLM RLVR加速的低秩优化轨迹建模

Escaping the Context Bottleneck: Active Context Curation for LLM Agents via Reinforcement Learning

逃离上下文瓶颈:通过强化学习实现LLM代理的主动上下文管理

To Learn or Not to Learn: A Litmus Test for Using Reinforcement Learning in Control

学习还是不学习:在控制中使用强化学习的试金石

Three Roles, One Model: Role Orchestration at Inference Time to Close the Performance Gap Between Small and Large Agents

三角色,一模型:推理时的角色编排,以缩小大代理之间的绩效差距

OOM-RL: Out-of-Money Reinforcement Learning Market-Driven Alignment for LLM-Based Multi-Agent Systems

OOM-RL:基于LLM的多智能体系统的非货币强化学习市场驱动对齐

CAGenMol: Condition-Aware Diffusion Language Model for Goal-Directed Molecular Generation

CAGenMol:目标导向分子生成的条件感知扩散语言模型

Policy Split: Incentivizing Dual-Mode Exploration in LLM Reinforcement with Dual-Mode Entropy Regularization

策略分裂:通过双模熵正则化激励LLM强化中的双模探索

Triviality Corrected Endogenous Reward

平凡纠正内生奖励

RLSpoofer: A Lightweight Evaluator for LLM Watermark Spoofing Resilience

RLSpoofer:一款用于LLM水印伪造韧性的轻量级评估器

Eliciting Medical Reasoning with Knowledge-enhanced Data Synthesis: A Semi-Supervised Reinforcement Learning Approach

通过知识增强数据综合激发医学推理:一种半监督强化学习方法

Relax: An Asynchronous Reinforcement Learning Engine for Omni-Modal Post-Training at Scale

Relax:一个面向大规模全模态后培训的异步强化学习引擎

Geoparsing: Diagram Parsing for Plane and Solid Geometry with a Unified Formal Language

地理解析:平面与实体几何的统一形式语言图解析

Utilizing and Calibrating Hindsight Process Rewards via Reinforcement with Mutual Information Self-Evaluation

利用并校准事后诸葛亮过程奖励,通过相互信息自我评估进行强化

RationalRewards: Reasoning Rewards Scale Visual Generation Both Training and Test Time

理性奖励:推理奖励可扩大视觉生成的训练和测试时间

Back to Basics: Let Conversational Agents Remember with Just Retrieval and Generation

回归基础:让对话代理仅通过检索和生成来记忆

Playing Along: Learning a Double-Agent Defender for Belief Steering via Theory of Mind

配合:通过心智理论学习双面间谍的信念引导辩护者

Collaborative Multi-Agent Scripts Generation for Enhancing Imperfect-Information Reasoning in Murder Mystery Games

协作多智能体脚本生成,用于增强谋杀悬疑游戏中不完美信息推理能力

Discourse Diversity in Multi-Turn Empathic Dialogue

多重共情对话中的话语多样性

Autonomous Diffractometry Enabled by Visual Reinforcement Learning

视觉强化学习支持的自主衍射测量

Solving Physics Olympiad via Reinforcement Learning on Physics Simulators

通过物理模拟器上的强化学习解决物理奥林匹克

Keyword: diffusion policy

OmniUMI: Towards Physically Grounded Robot Learning via Human-Aligned Multimodal Interaction

OmniUMI:通过人与人对齐的多模态交互实现物理基础机器人学习

AffordSim: A Scalable Data Generator and Benchmark for Affordance-Aware Robotic Manipulation

AffordSim:可扩展的数据生成器及可适用性意识机器人操作的基准测试