生成时间: 2026-02-20 16:44:04 (UTC+8); Arxiv 发布时间: 2026-02-20 20:00 EST (2026-02-21 09:00 UTC+8)

今天共有 24 篇相关文章

Keyword: reinforcement learning

DeepVision-103K: A Visually Diverse, Broad-Coverage, and Verifiable Mathematical Dataset for Multimodal Reasoning

DeepVision-103K:一个视觉多样、覆盖广泛且可验证的多模态推理数学数据集

References Improve LLM Alignment in Non-Verifiable Domains

参考文献提升不可验证领域的LLM对齐

VAM: Verbalized Action Masking for Controllable Exploration in RL Post-Training -- A Chess Case Study

VAM:在强化学习后训练中可控探索的口头动作掩蔽——国际象棋案例研究

Training Large Reasoning Models Efficiently via Progressive Thought Encoding

通过渐进式思维编码高效训练大型推理模型

SimToolReal: An Object-Centric Policy for Zero-Shot Dexterous Tool Manipulation

SimToolReal:零发灵巧工具作的以对象为中心的策略

Discovering Multiagent Learning Algorithms with Large Language Models

利用大型语言模型发现多智能体学习算法

LLM4Cov: Execution-Aware Agentic Learning for High-coverage Testbench Generation

LLM4Cov:执行感知智能体学习,用于高覆盖测试平台生成

A Unified Framework for Locality in Scalable MARL

可扩展MARL中本地性的统一框架

A testable framework for AI alignment: Simulation Theology as an engineered worldview for silicon-based agents

一个可测试的人工智能对齐框架:仿真神学作为硅基智能体工程化世界观

Action-Graph Policies: Learning Action Co-dependencies in Multi-Agent Reinforcement Learning

动作图策略:学习多智能体强化学习中的动作共依存关系

Phase-Aware Mixture of Experts for Agentic Reinforcement Learning

能动强化学习的专家相位感知混合

Retaining Suboptimal Actions to Follow Shifting Optima in Multi-Agent Reinforcement Learning

在多智能体强化学习中,保留次优动作以遵循最优位移

Spatio-temporal dual-stage hypergraph MARL for human-centric multimodal corridor traffic signal control

时空双级超图MARL用于以人为中心的多模态走廊交通信号控制

Safe Continuous-time Multi-Agent Reinforcement Learning via Epigraph Form

通过引言形式实现安全连续时间多智能体强化学习

AgentConductor: Topology Evolution for Multi-Agent Competition-Level Code Generation

AgentConductor:多智能体竞争级代码生成的拓扑演进

Continual uncertainty learning

持续不确定性学习

RLGT: A reinforcement learning framework for extremal graph theory

RLGT:极值图论的强化学习框架

LexiSafe: Offline Safe Reinforcement Learning with Lexicographic Safety-Reward Hierarchy

LexiSafe:带有词典编安全-奖励层级的离线安全强化学习

Computer-Using World Model

计算机使用世界模型

MASPO: Unifying Gradient Utilization, Probability Mass, and Signal Reliability for Robust and Sample-Efficient LLM Reasoning

MASPO:统一梯度利用、概率质量和信号可靠性,实现稳健且样本高效的大型语言模型推理

RetouchIQ: MLLM Agents for Instruction-Based Image Retouching with Generalist Reward

RetouchIQ:基于指令的图像修图MLLM代理,兼具通用奖励

Adapting Actively on the Fly: Relevance-Guided Online Meta-Learning with Latent Concepts for Geospatial Discovery

实时积极适应:基于相关性引导的在线元学习,以潜在概念促进地理空间发现

Stable Asynchrony: Variance-Controlled Off-Policy RL for LLMs

稳定异步:LLM的方差控制非策略强化学习

SMAC: Score-Matched Actor-Critics for Robust Offline-to-Online Transfer

SMAC:评分匹配演员-影评人,支持线下转线的强力转账

Keyword: diffusion policy

There is no result