Application Of K-Means And Hierarchical Clustering For COVID-19 Data Analysis

Analysis of COVID-19 Global Spread Data

K-Means Clustering is used to solve the clustering problem in python. In this algorithm, we divided the unlabelled dataset into k different clusters such that each data is associated with one group with similar properties. Hierarchical clustering is also used to group the unlabelled datasets into a cluster. The concept of hierarchical clustering is similar to K means clustering. We have used both of these clustering methods in task 1. In task 2, we have used Market basket analysis by the Apriori algorithm.

This task applies unsupervised learning techniques, namely k-means clustering and hierarchical clustering, to cluster different countries into groups by COVID statistics. In the dataset, there is a total of 14 variables in the dataset of which first is the country names, and the rest are COVID related attributes like population, total cases, new cases, total deaths, new deaths, totally recovered, newly recovered, active cases, critical cases, total cases per 1 million population, deaths per 1 million population, the total number of COVID tests performed and total tests per 1 million population. After loading the dataset, different pre-processing are applied to improve data quality. Then, important features are selected by Principal Component Analysis (Garikapati et al. 2021). Then, unsupervised learning methods are applied to cluster the dataset into groups.

Country Population Total Cases New Cases Total Deaths New Deaths

0 World NaN 406944092 638463.0 5810434 2654.0

1 USA 334125343 79052681 NaN 939,427 NaN

2 India 1401828285 42536137 NaN 507,208 NaN

3 Brazil 214990157 27125512 NaN 636,111 NaN

4 France 65505907 21372278 NaN 134,207 NaN

Total Recovered New Revovered Active Cases Serious, Critical Cases

0 327036011.0 915688.0 74097647.0 89283.0

1 49435538.0 NaN 28677716.0 17454.0

2 41331158.0 NaN 697771.0 8944.0

3 23446849.0 NaN 3042552.0 8318.0

4 16256611.0 389010.0 4981460.0 3622.0

Tot Cases/ 1M pop Deaths/ 1M pop Total Tests Tests/ 1M pop

0 52207.0 745.4 NaN NaN

1 236596.0 2812.0 923990953.0 2765402.0

2 30343.0 362.0 747870047.0 533496.0

3 126171.0 2959.0 63776166.0 296647.0

4 326265.0 2049.0 238632994.0 3642923.0

As seen from the output there are significant number of missing instances which are explored as given below as well the data types of attributes.

Country object

Population object

Total Cases int64

New Cases float64

Total Deaths object

New Deaths float64

Total Recovered float64

New Revovered float64

Active Cases float64

Serious, Critical Cases float64

Tot Cases/ 1M pop float64

Deaths/ 1M pop float64

Total Tests float64

Tests/ 1M pop float64

First Few Instances of Loaded Dataset

As seen from the above, the population is detected as an object that is inappropriate and thus needs to be converted to numeric. The rest of the variables are detected appropriately.

Country 0.000000

Population 1.762115

Total Cases 0.000000

New Cases 82.378855

Total Deaths 3.964758

New Deaths 86.784141

Total Recovered 3.524229

New Revovered 83.700441

Active Cases 3.524229

Serious, Critical Cases 26.872247

Tot Cases/ 1M pop 0.881057

Deaths/ 1M pop 4.845815

Total Tests 7.929515

Tests/ 1M pop 7.929515

Now, it is observed that new cases, new deaths, and newly recovered have a high percentage of missing values of over 80%. Thus, removing instances listwise by these attributes will highly reduce the size of data and imputing these values will degrade the quality of data. Hence, these attributes are removed from the DataFrame. Also, the country attribute has no importance in clustering as it is just a country name and thus removed from DataFrame.

The population attribute is converted to numeric, and the invalid values are coerced with Nan values in the DataFrame. Finally, all the missing or Nan values in the DataFrame are imputed by mean of columns as average is a good measure for imputation that reserves most of the characteristic of the dataset. It can be seen that after imputation, there are no missing values in the DataFrame as given below.

Population 0

Total Cases 0

Total Deaths 0

Total Recovered 0

Active Cases 0

Serious, Critical Cases 0

Tot Cases/ 1M pop 0

Deaths/ 1M pop 0

Total Tests 0

Tests/ 1M pop 0

The outliers in the feature attributes are explored by boxplots as given below.

As seen from the boxplots, almost all the features have outliers except total deaths, and thus necessary pre-processing is needed to remove the outliers as given below.

As given in the above code, the outliers are defined by calculating quartiles. Typically, the lower bound below which all values are outliers is (1^st quartile – 1.5* IQR), and the upper bound below which all values are outliers is (3^rd quartile + 1.5*IQR), where IQR = Inter-quartile range = Q3 – Q1. This function is applied for all columns, and data without outliers is extracted by logical indexing (Dai and Chang 2021).

Then normalization of data is performed as k-means or any other unsupervised learning method that fits well on normalized data. Typically, min-max normalization is performed on the outlier filtered data by the sci-kit learn module (Jo 2019). The min-max normalization formula is given by,

Data Types

Now, principal component analysis is applied to reduce the dimensionality of features as given in the following code.

In the above code, PCA is applied to the whole normalized dataset, and then the number of components that explain over 95% of the total variation is extracted, and it is found that 7 components are enough to explain at least 95% variation (Uddin, Mamun and Hossain 2021). The PCA transformation function transforms the input columns in the order of highest to lowest components in terms of variation, and hence the first 7 components are extracted later to apply unsupervised algorithms.

Optimal number of K-means clusters by the within-cluster sum of square:

It can be seen that as k-means algorithm applied on PCA extracted data with different k values from 1 to 10, the WCSS score significantly drop from k=1 to k=2, and then the drop with increase in k are not significant, and hence k=2 is optimal for the dataset (Gárate-Escamila, El Hassani and Andrès 2020).

Optimal Number of K-means Clusters by Silhouette Score:

It is seen that the silhouette score is maximum for k=2, and thus there is an ideal number of two groups of countries in the dataset, which is supported by both methods of optimal K-means clustering (Jahwar and Abdulazeez 2020).

As observed from above, the two clusters indicated by different coloured circles are much apart, and thus no overlapping of points of different groups are observed, indicating the clustering is very good (Yuan and Yang 2019). Also, by the length of arrows, the importance of features on the first two principal components is observed. It is found that features serious cases and population have significantly higher importance than the other four features (Sinaga and Yang 2020).

The Agglomerative hierarchical clustering is now applied by the sci-kit learn module with an unspecified number of clusters and a complete linkage method. Among other linkage methods, the complete linkage is chosen as it uses the maximum distance between all observations in the two sets of merged clusters. Thus, clusters are well separated. Also, the distance threshold is set to 0 to ensure that clusters are not merged (Babichev et al. 2019). The dendrogram with these parameters of hierarchical clustering is plotted by the dendrogram function as given below.

Dendrogram:

It can be observed from the dendrogram that there are three different colours of clusters, excluding the blue coloured top cluster (as the top cluster contains all of the data), and thus from the dendrogram, the optimal number of clusters by hierarchical clustering is 3 (Roux 2018).

Percentage of Missing Values by Column

Plot of Clusters and Features on PCA Axis:

Now, it is observed that three clusters, as shown by different coloured points, are not well separated as many of them fall in the same region and hence quality of clustering is not very good. However, the importance of features on the primary two PCA components are the same as found by k-means (Fauzi, Rustam and Wibowo 2021).

Hence, from the two different unsupervised learning approaches, the k-means method is best to group data into two optimal numbers of clusters. Hence, countries can be grouped into two groups that are unequal in size, typically in cluster labelled 1, there are 98 countries and cluster labelled 0, there are 53 countries.

In this particular task, data analysis of radio listeners is performed to find an association among songs listened to by different users on the radio. This analysis will be used to recommend music to users as a focused marketing strategy, which in turn does advertisement of music that may be wished for listening by users. In the given dataset, data from a music community gives details of music listened to by different users. In the dataset, there are four columns which are user (given by unique ID of a user), artist (the name of the artist), sex (the gender of the user) and the country of the user. The dataset is first transformed into an incidence matrix with rows of each user and columns as the artists whose songs are listened to by the users (Wen et al. 2019). Then Apriori algorithm is applied, and the support for all the artists are calculated with mining of rules satisfying minimum support, confidence and lift threshold.

user artist sex country

0 1 red hot chili peppers f Germany

1 1 the black dahlia murder f Germany

2 1 goldfrapp f Germany

3 1 dropkick murphys f Germany

4 1 le tigre f Germany

Missing values by columns:

user 0

artist 0

sex 0

country 0

Hence, it can be seen that there are no missing values in any of the columns, and thus no filtering of instances is required.

user artist sex country

102441 6980 m.i.a. f United Kingdom

102446 6980 m.i.a. f United Kingdom

143737 9753 james brown m Germany

143746 9753 james brown m Germany

Number of unique users: 15000

Number of unique Artists: 1004

Number of unique Countries: 159

Missing Values afte Imputation

Distribution of sex:

m 0.730537

f 0.269463

Hence, it can be seen that in the sample, there are significantly a greater number of male users than female users.

Number of music by country with bar plot:

It can be seen that most music as listened to by users are from United Stated of frequency close to 60000 and then from United Kingdom, Germany, Poland which also have a significantly large frequency of over 10000.

Now, the incidence matrix is formed by using the user and artist column where each list represents a user, and the values in the list are artists songs that are listened to by the user or artists are columns (Kurnia et al. 2019). Hence, there are 15000 rows in the incidence matrix represented as a list of lists for which the first 5 rows are displayed below.

[[‘red hot chili peppers’, ‘the black dahlia murder’, ‘goldfrapp’, ‘dropkick murphys’, ‘le tigre’, ‘schandmaul’, ‘edguy’, ‘jack johnson’, ‘eluveitie’, ‘the killers’, ‘judas priest’, ‘rob zombie’, ‘john mayer’, ‘the who’, ‘guano apes’, ‘the rolling stones’], [‘devendra banhart’, ‘boards of canada’, ‘cocorosie’, ‘aphex twin’, ‘animal collective’, ‘atmosphere’, ‘joanna newsom’, ‘air’, ‘portishead’, ‘massive attack’, ‘broken social scene’, ‘arcade fire’, ‘plaid’, ‘prefuse 73’, ‘m83’, ‘the flashbulb’, ‘pavement’, ‘goldfrapp’, ‘amon tobin’, ‘sage francis’, ‘four tet’, ‘max richter’, ‘autechre’, ‘radiohead’, ‘neutral milk hotel’, ‘beastie boys’, ‘aesop rock’, ‘mf doom’, ‘the books’], [‘tv on the radio’, ‘tool’, ‘kyuss’, ‘dj shadow’, ‘air’, ‘a tribe called quest’, ‘the cinematic orchestra’, ‘beck’, ‘bon iver’, ‘röyksopp’, ‘bonobo’, ‘the decemberists’, ‘snow patrol’, ‘battles’, ‘the prodigy’, ‘pink floyd’, ‘rjd2’, ‘the flaming lips’, ‘michael jackson’, ‘mgmt’, ‘the rolling stones’, ‘late of the pier’, ‘flight of the conchords’, ‘simian mobile disco’, ‘muse’, ‘fleetwood mac’, ‘led zeppelin’], [‘dream theater’, ‘ac/dc’, ‘metallica’, ‘iron maiden’, ‘bob marley & the wailers’, ‘megadeth’, ‘children of bodom’, ‘trivium’, ‘nightwish’, ‘sublime’, ‘volbeat’], [‘lily allen’, ‘kanye west’, ‘sigur rós’, ‘pink floyd’, ‘stevie wonder’, ‘metallica’, ‘thievery corporation’, ‘iron maiden’, ‘the streets’, ‘muse’, ‘faith no more’, ‘manu chao’, ‘tenacious d’, ‘depeche mode’, ‘justin timberlake’, ‘green day’, ‘snow patrol’, ‘dream theater’, ‘u2’, ‘jay-z’, ‘type o negative’, ‘pearl jam’, ‘queen’]]

Now, the association of songs listened to by different users is constructed by Apriori algorithm with different settings of minimum support, confidence, and lift. Now, the top 3 users are those who listen to the highest number of artists, and this is calculated from the incidence matrix by getting a count of each list in the list of lists and then sorting the DataFrame.

Outlier Exploration

Index of top 3 users:

User index Number of artists

13459 13459 76

11494 11494 63

928 928 55

It is seen that the top user listens to a total of 76 artists, then the 2^nd user from the top listens to the music of 63 artists, and the 3^rd user from the top listens to 55 music.

The Apriori algorithm is used to mine rules among artists for the top users with minimum support =0.03, minimum confidence = 0.4, and minimum lift = 2. These values are adjusted to get significant confidence on mines rules, and with the chosen parameters, 7 reasonably high confidence rules are mined as given below (Rekik et al. 2018).

Left Hand Side Right Hand Side Support Confidence Lift

3 led zeppelin pink floyd 0.032000 0.404040 3.850449

0 bob dylan the beatles 0.034467 0.497115 2.794877

5 sigur rós radiohead 0.034200 0.492795 2.733702

6 the rolling stones the beatles 0.030467 0.484110 2.721759

1 the killers coldplay 0.041067 0.418194 2.637894

2 david bowie the beatles 0.031733 0.430769 2.421866

4 led zeppelin the beatles 0.033467 0.422559 2.375706

Transaction

3 led zeppelin–pink floyd

0 bob dylan–the beatles

5 sigur rós–radiohead

6 the rolling stones–the beatles

1 the killers–coldplay

2 david bowie–the beatles

4 led zeppelin–the beatles

It can be seen from the output that the top rule is ‘User who listens bob Dylan music also has the probability of about 49.27% to listen the Beatles music’ and similarly other rules can be observed. The lowest confidence rule is ‘User who listens led zeppelin music has the probability of about 40% to listen to pink Floyd music.’ It is observed that all 7 rules have reasonably high confidence over 40% and thus can be used by radio companies to target customers to suggest music to users accordingly (Ghafari and Tjortjis 2019).

Conclusion:

Hence, by clustering analysis of COVID data of different countries, they can be optimally grouped in two clusters of unequal size. Hence, from the point of view of the government, the countries which fall in the same cluster have the same COVID characteristics, and thus in the same group, it is not needed to form different strategies to slow down COVID spread or to prevent the disease, instead one generalized strategy can be formed for all countries in the same group. After employing the strategy, the result can be observed. If needed, a few modifications can be done on a country-to-country basis. Thus, by using the clustering analysis results, the government is certainly helped in terms of time and cost effectiveness, which in turn help to form good strategies to fight COVID.

The market basket analysis as applied with Apriori algorithm on top 3 baskets or top 3 listeners of sound, it is found that there are 7 rules with strong confidence can be formed which shows a critical association between listening music of different artists. Hence, using these rules, the radio company can target the top 3 users, and when the users play a song of an artist, then the songs of other artists associated with it can be suggested to improve users’ music experience. This is likely to increase user involvement with radio and thus develop the business of radio companies further.

Babichev, S., Durnyak, B., Pikh, I. and Senkivskyy, V., 2019, May. An evaluation of the objective clustering inductive technology effectiveness implemented using density-based and agglomerative hierarchical clustering algorithms. In International Scientific Conference “Intellectual Systems of Decision Making and Problem of Computational Intelligence” (pp. 532-553). Springer, Cham.

Dai, Z. and Chang, X., 2021. Predicting Stock Return with Economic Constraint: Can Interquartile Range Truncate the Outliers?. Mathematical Problems in Engineering, 2021.

Fauzi, I.R., Rustam, Z. and Wibowo, A., 2021. Multiclass classification of leukemia cancer data using Fuzzy Support Vector Machine (FSVM) with feature selection using Principal Component Analysis (PCA). In Journal of Physics: Conference Series (Vol. 1725, No. 1, p. 012012). IOP Publishing.

Gárate-Escamila, A.K., El Hassani, A.H. and Andrès, E., 2020. Classification models for heart disease prediction using feature selection and PCA. Informatics in Medicine Unlocked, 19, p.100330.

Garikapati, P., Balamurugan, K., Latchoumi, T.P. and Malkapuram, R., 2021. A Cluster-Profile Comparative Study on Machining AlSi7/63% of SiC Hybrid Composite Using Agglomerative Hierarchical Clustering and K-Means. Silicon, 13(4), pp.961-972.

Ghafari, S.M. and Tjortjis, C., 2019. A survey on association rules mining using heuristics. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 9(4), p.e1307.

Jahwar, A.F. and Abdulazeez, A.M., 2020. Meta-heuristic algorithms for K-means clustering: A review. PalArch’s Journal of Archaeology of Egypt/Egyptology, 17(7), pp.12002-12020.

Jo, J.M., 2019. Effectiveness of normalization pre-processing of big data to the machine learning performance. The Journal of the Korea institute of electronic communication sciences, 14(3), pp.547-552.

Kurnia, Y., Isharianto, Y., Giap, Y.C. and Hermawan, A., 2019, March. Study of application of data mining market basket analysis for knowing sales pattern (association of items) at the O! Fish restaurant using apriori algorithm. In Journal of Physics: Conference Series (Vol. 1175, No. 1, p. 012047). IOP Publishing.

Rekik, R., Kallel, I., Casillas, J. and Alimi, A.M., 2018. Assessing web sites quality: A systematic literature review by text and association rules mining. International journal of information management, 38(1), pp.201-216.

Roux, M., 2018. A comparative study of divisive and agglomerative hierarchical clustering algorithms. Journal of Classification, 35(2), pp.345-366.

Sinaga, K.P. and Yang, M.S., 2020. Unsupervised K-means clustering algorithm. IEEE access, 8, pp.80716-80727.

Uddin, M.P., Mamun, M.A. and Hossain, M.A., 2021. PCA-based feature reduction for hyperspectral remote sensing image classification. IETE Technical Review, 38(4), pp.377-396.

Wen, F., Zhang, G., Sun, L., Wang, X. and Xu, X., 2019. A hybrid temporal association rules mining method for traffic congestion prediction. Computers & Industrial Engineering, 130, pp.779-787.

Yuan, C. and Yang, H., 2019. Research on K-value selection method of K-means clustering algorithm. J, 2(2), pp.226-235.

Turn in your highest-quality paper
Get a qualified writer to help you with

“ Application Of K-Means And Hierarchical Clustering For COVID-19 Data Analysis ”

Get high-quality paper

NEW! AI matching with writer