生成时间: 2025-11-12 16:31:12 (UTC+8); Arxiv 发布时间: 2025-11-12 20:00 EST (2025-11-13 09:00 UTC+8)

今天共有 39 篇相关文章

Keyword: reinforcement learning

Towards Affordable, Adaptive and Automatic GNN Training on CPU-GPU Heterogeneous Platforms

在CPU-GPU异构平台上实现经济实惠、自适应和自动的GNN训练

RELEAP: Reinforcement-Enhanced Label-Efficient Active Phenotyping for Electronic Health Records

RELEAP:用于电子健康记录的强化增强标签高效主动表型分析

The Polite Liar: Epistemic Pathology in Language Models

礼貌的骗子:语言模型中的认识病理学

Beyond Correctness: Confidence-Aware Reward Modeling for Enhancing Large Language Model Reasoning

超越正确性:用于增强大型语言模型推理的置信度感知奖励建模

Think Before You Retrieve: Learning Test-Time Adaptive Search with Small Language Models

检索前三思:使用小型语言模型学习测试时自适应搜索

Partial Action Replacement: Tackling Distribution Shift in Offline MARL

部分行动替换:解决离线 MARL 中的分销转移问题

Time-Aware Policy Learning for Adaptive and Punctual Robot Control

自适应准时机器人控制的时间感知策略学习

ZeroSim: Zero-Shot Analog Circuit Evaluation with Unified Transformer Embeddings

ZeroSim:使用统一变压器嵌入进行零样本模拟电路评估

Diffusion Guided Adversarial State Perturbations in Reinforcement Learning

强化学习中的扩散引导对抗态扰动

Intelligent Optimization of Multi-Parameter Micromixers Using a Scientific Machine Learning Framework

基于科学机器学习框架的多参数微混合器的智能优化

A Negotiation-Based Multi-Agent Reinforcement Learning Approach for Dynamic Scheduling of Reconfigurable Manufacturing Systems

一种基于协商的多智能体强化学习方法,用于可重构制造系统动态调度

From Exploration to Exploitation: A Two-Stage Entropy RLVR Approach for Noise-Tolerant MLLM Training

从探索到开发:一种用于噪声容阻MLLM训练的两阶段熵RLVR方法

High-Altitude Balloon Station-Keeping with First Order Model Predictive Control

基于一阶模型预测控制的高空气球站保持

A Historical Interaction-Enhanced Shapley Policy Gradient Algorithm for Multi-Agent Credit Assignment

一种用于多智能体信用分配的历史交互增强Shapley策略梯度算法

From Experience to Strategy: Empowering LLM Agents with Trainable Graph Memory

从经验到策略:为法学硕士代理提供可训练的图记忆

MURPHY: Multi-Turn GRPO for Self Correcting Code Generation

MURPHY:用于自更正代码生成的多圈 GRPO

Comparative Study of Q-Learning for State-Feedback LQG Control with an Unknown Model

Q-learning与未知模型状态反馈LQG控制的对比研究

Statistically Assuring Safety of Control Systems using Ensembles of Safety Filters and Conformal Prediction

使用安全滤波器和共形预测的集合对控制系统进行统计保证安全

Test-driven Reinforcement Learning

测试驱动的强化学习

Feedback Descent: Open-Ended Text Optimization via Pairwise Comparison

反馈下降:通过成对比较进行开放式文本优化

SERL: Self-Examining Reinforcement Learning on Open-Domain

SERL:开放领域自检强化学习

SpeechJudge: Towards Human-Level Judgment for Speech Naturalness

SpeechJudge:迈向人类对语音自然性的判断

Thinker: Training LLMs in Hierarchical Thinking for Deep Search via Multi-Turn Interaction

思考者:通过多轮交互训练法学硕士进行深度搜索的层次思维

Knowledge-Augmented Long-CoT Generation for Complex Biomolecular Reasoning

用于复杂生物分子推理的知识增强长 CoT 生成

Dynamic Sparsity: Challenging Common Sparsity Assumptions for Learning World Models in Robotic Reinforcement Learning Benchmarks

动态稀疏性:在机器人强化学习基准中挑战学习世界模型的常见稀疏性假设

A Small Leak Sinks All: Exploring the Transferable Vulnerability of Source Code Models

一个小小的泄漏会淹没一切:探索源代码模型的可转移漏洞

BIPPO: Budget-Aware Independent PPO for Energy-Efficient Federated Learning Services

BIPPO:用于节能联合学习服务的预算意识独立 PPO

An Efficient Training Pipeline for Reasoning Graphical User Interface Agents

用于推理图形用户界面代理的高效训练管道

UI2Code$^\text{N}$: A Visual Language Model for Test-Time Scalable Interactive UI-to-Code Generation

UI2Code$^\text{N}$:用于测试时可扩展交互式 UI 到代码生成的可视化语言模型

Beyond Distributions: Geometric Action Control for Continuous Reinforcement Learning

超越分布:用于持续强化学习的几何动作控制

PrefPoE: Advantage-Guided Preference Fusion for Learning Where to Explore

PrefPoE:优势引导的偏好融合,用于了解探索地点

Where and What Matters: Sensitivity-Aware Task Vectors for Many-Shot Multimodal In-Context Learning

何处何事重要:用于多样本多模态上下文学习的敏感性感知任务向量

AgentPRM: Process Reward Models for LLM Agents via Step-Wise Promise and Progress

AgentPRM:通过逐步承诺和进步为 LLM 代理处理奖励模型

LPPG-RL: Lexicographically Projected Policy Gradient Reinforcement Learning with Subproblem Exploration

LPPG-RL:词典式预测策略梯度强化学习与子问题探索

ARAC: Adaptive Regularized Multi-Agent Soft Actor-Critic in Graph-Structured Adversarial Games

ARAC:图结构对抗博弈中的自适应正则化多智能体软 Actor-Critic

Understanding Electro-communication and Electro-sensing in Weakly Electric Fish using Multi-Agent Deep Reinforcement Learning

利用多智能体深度强化学习了解弱电鱼的电通信和电传感

RESTL: Reinforcement Learning Guided by Multi-Aspect Rewards for Signal Temporal Logic Transformation

RESTL:多方面奖励指导的信号时间逻辑变换强化学习

The Path Not Taken: RLVR Provably Learns Off the Principals

未走的路:RLVR 可证明可以向校长学习

DeepProofLog: Efficient Proving in Deep Stochastic Logic Programs

DeepProofLog:深度随机逻辑程序中的高效证明

Keyword: diffusion policy

There is no result