With the increasing interconnectivity of cyber-physical systems (CPSs) in various fields, such as manufacturing plants, power plants, and smart networked systems, large amounts of multivariate data are generated through sensors and actuators, also other data sources such as measurements and images. This paper focuses on the anomaly detection (AD) problem, also known as fault detection or outlier detection, depending on the type of dataset, which involves identifying anomalous values in the dataset using analytical methods. However, datasets often contain missing values, which can lead to incorrect outcomes and affect the availability of anomalous samples that are fewer in amount, making incomplete datasets. Therefore, a generalized AD method is proposed for incomplete datasets, which involves two steps: data imputation (DI) to obtain complete datasets using GAN and later AD for the complete datasets. While statistical-based imputation methods are commonly used, they do not consider data distribution for datasets with anomalous samples. The capabilities of GANbased DI are tested under different hyperparameter settings and percentages of missing values. The AD problem is then addressed using seven unsupervised anomaly detection methods on six different datasets, including a real dataset from a steel manufacturing plant in Italy. Each dataset is analyzed to determine which DI and AD method combination performs the best. The results show that GAN-imputed data provides the best DI performance, while the reweighted minimum covariance determinant (RMCD) method offers the overall best AD results combined with GAN.
Unsupervised Anomaly Detection for Multivariate Incomplete Data using GAN-based Data Imputation: A Comparative Study
Sarda K.;Yerudkar A.;Vecchio C. D.
2023-01-01
Abstract
With the increasing interconnectivity of cyber-physical systems (CPSs) in various fields, such as manufacturing plants, power plants, and smart networked systems, large amounts of multivariate data are generated through sensors and actuators, also other data sources such as measurements and images. This paper focuses on the anomaly detection (AD) problem, also known as fault detection or outlier detection, depending on the type of dataset, which involves identifying anomalous values in the dataset using analytical methods. However, datasets often contain missing values, which can lead to incorrect outcomes and affect the availability of anomalous samples that are fewer in amount, making incomplete datasets. Therefore, a generalized AD method is proposed for incomplete datasets, which involves two steps: data imputation (DI) to obtain complete datasets using GAN and later AD for the complete datasets. While statistical-based imputation methods are commonly used, they do not consider data distribution for datasets with anomalous samples. The capabilities of GANbased DI are tested under different hyperparameter settings and percentages of missing values. The AD problem is then addressed using seven unsupervised anomaly detection methods on six different datasets, including a real dataset from a steel manufacturing plant in Italy. Each dataset is analyzed to determine which DI and AD method combination performs the best. The results show that GAN-imputed data provides the best DI performance, while the reweighted minimum covariance determinant (RMCD) method offers the overall best AD results combined with GAN.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.