K-Means Clustering is used to solve the clustering problem in python. In this algorithm, we divided the unlabelled dataset into k different clusters such that each data is associated with one group with similar properties. Hierarchical clustering is also used to group the unlabelled datasets into a cluster. The concept of hierarchical clustering is similar to K means clustering. We have used both of these clustering methods in task 1. In task 2, we have used Market basket analysis by the Apriori algorithm.
This task applies unsupervised learning techniques, namely k-means clustering and hierarchical clustering, to cluster different countries into groups by COVID statistics. In the dataset, there is a total of 14 variables in the dataset of which first is the country names, and the rest are COVID related attributes like population, total cases, new cases, total deaths, new deaths, totally recovered, newly recovered, active cases, critical cases, total cases per 1 million population, deaths per 1 million population, the total number of COVID tests performed and total tests per 1 million population. After loading the dataset, different pre-processing are applied to improve data quality. Then, important features are selected by Principal Component Analysis (Garikapati et al. 2021). Then, unsupervised learning methods are applied to cluster the dataset into groups.
Country Population Total Cases New Cases Total Deaths New Deaths
0 World NaN 406944092 638463.0 5810434 2654.0
1 USA 334125343 79052681 NaN 939,427 NaN
2 India 1401828285 42536137 NaN 507,208 NaN
3 Brazil 214990157 27125512 NaN 636,111 NaN
4 France 65505907 21372278 NaN 134,207 NaN
Total Recovered New Revovered Active Cases Serious, Critical Cases
0 327036011.0 915688.0 74097647.0 89283.0
1 49435538.0 NaN 28677716.0 17454.0
2 41331158.0 NaN 697771.0 8944.0
3 23446849.0 NaN 3042552.0 8318.0
4 16256611.0 389010.0 4981460.0 3622.0
Tot Cases/ 1M pop Deaths/ 1M pop Total Tests Tests/ 1M pop
0 52207.0 745.4 NaN NaN
1 236596.0 2812.0 923990953.0 2765402.0
2 30343.0 362.0 747870047.0 533496.0
3 126171.0 2959.0 63776166.0 296647.0
4 326265.0 2049.0 238632994.0 3642923.0
As seen from the output there are significant number of missing instances which are explored as given below as well the data types of attributes.
Country object
Population object
Total Cases int64
New Cases float64
Total Deaths object
New Deaths float64
Total Recovered float64
New Revovered float64
Active Cases float64
Serious, Critical Cases float64
Tot Cases/ 1M pop float64
Deaths/ 1M pop float64
Total Tests float64
Tests/ 1M pop float64
As seen from the above, the population is detected as an object that is inappropriate and thus needs to be converted to numeric. The rest of the variables are detected appropriately.
Country 0.000000
Population 1.762115
Total Cases 0.000000
New Cases 82.378855
Total Deaths 3.964758
New Deaths 86.784141
Total Recovered 3.524229
New Revovered 83.700441
Active Cases 3.524229
Serious, Critical Cases 26.872247
Tot Cases/ 1M pop 0.881057
Deaths/ 1M pop 4.845815
Total Tests 7.929515
Tests/ 1M pop 7.929515
Now, it is observed that new cases, new deaths, and newly recovered have a high percentage of missing values of over 80%. Thus, removing instances listwise by these attributes will highly reduce the size of data and imputing these values will degrade the quality of data. Hence, these attributes are removed from the DataFrame. Also, the country attribute has no importance in clustering as it is just a country name and thus removed from DataFrame.
The population attribute is converted to numeric, and the invalid values are coerced with Nan values in the DataFrame. Finally, all the missing or Nan values in the DataFrame are imputed by mean of columns as average is a good measure for imputation that reserves most of the characteristic of the dataset. It can be seen that after imputation, there are no missing values in the DataFrame as given below.
Population 0
Total Cases 0
Total Deaths 0
Total Recovered 0
Active Cases 0
Serious, Critical Cases 0
Tot Cases/ 1M pop 0
Deaths/ 1M pop 0
Total Tests 0
Tests/ 1M pop 0
The outliers in the feature attributes are explored by boxplots as given below.
As seen from the boxplots, almost all the features have outliers except total deaths, and thus necessary pre-processing is needed to remove the outliers as given below.
As given in the above code, the outliers are defined by calculating quartiles. Typically, the lower bound below which all values are outliers is (1st quartile – 1.5* IQR), and the upper bound below which all values are outliers is (3rd quartile + 1.5*IQR), where IQR = Inter-quartile range = Q3 – Q1. This function is applied for all columns, and data without outliers is extracted by logical indexing (Dai and Chang 2021).
Then normalization of data is performed as k-means or any other unsupervised learning method that fits well on normalized data. Typically, min-max normalization is performed on the outlier filtered data by the sci-kit learn module (Jo 2019). The min-max normalization formula is given by,
Now, principal component analysis is applied to reduce the dimensionality of features as given in the following code.
In the above code, PCA is applied to the whole normalized dataset, and then the number of components that explain over 95% of the total variation is extracted, and it is found that 7 components are enough to explain at least 95% variation (Uddin, Mamun and Hossain 2021). The PCA transformation function transforms the input columns in the order of highest to lowest components in terms of variation, and hence the first 7 components are extracted later to apply unsupervised algorithms.
Optimal number of K-means clusters by the within-cluster sum of square:
It can be seen that as k-means algorithm applied on PCA extracted data with different k values from 1 to 10, the WCSS score significantly drop from k=1 to k=2, and then the drop with increase in k are not significant, and hence k=2 is optimal for the dataset (Gárate-Escamila, El Hassani and Andrès 2020).
Optimal Number of K-means Clusters by Silhouette Score:
It is seen that the silhouette score is maximum for k=2, and thus there is an ideal number of two groups of countries in the dataset, which is supported by both methods of optimal K-means clustering (Jahwar and Abdulazeez 2020).
As observed from above, the two clusters indicated by different coloured circles are much apart, and thus no overlapping of points of different groups are observed, indicating the clustering is very good (Yuan and Yang 2019). Also, by the length of arrows, the importance of features on the first two principal components is observed. It is found that features serious cases and population have significantly higher importance than the other four features (Sinaga and Yang 2020).
The Agglomerative hierarchical clustering is now applied by the sci-kit learn module with an unspecified number of clusters and a complete linkage method. Among other linkage methods, the complete linkage is chosen as it uses the maximum distance between all observations in the two sets of merged clusters. Thus, clusters are well separated. Also, the distance threshold is set to 0 to ensure that clusters are not merged (Babichev et al. 2019). The dendrogram with these parameters of hierarchical clustering is plotted by the dendrogram function as given below.
Dendrogram:
It can be observed from the dendrogram that there are three different colours of clusters, excluding the blue coloured top cluster (as the top cluster contains all of the data), and thus from the dendrogram, the optimal number of clusters by hierarchical clustering is 3 (Roux 2018).
Plot of Clusters and Features on PCA Axis:
Now, it is observed that three clusters, as shown by different coloured points, are not well separated as many of them fall in the same region and hence quality of clustering is not very good. However, the importance of features on the primary two PCA components are the same as found by k-means (Fauzi, Rustam and Wibowo 2021).
Hence, from the two different unsupervised learning approaches, the k-means method is best to group data into two optimal numbers of clusters. Hence, countries can be grouped into two groups that are unequal in size, typically in cluster labelled 1, there are 98 countries and cluster labelled 0, there are 53 countries.
In this particular task, data analysis of radio listeners is performed to find an association among songs listened to by different users on the radio. This analysis will be used to recommend music to users as a focused marketing strategy, which in turn does advertisement of music that may be wished for listening by users. In the given dataset, data from a music community gives details of music listened to by different users. In the dataset, there are four columns which are user (given by unique ID of a user), artist (the name of the artist), sex (the gender of the user) and the country of the user. The dataset is first transformed into an incidence matrix with rows of each user and columns as the artists whose songs are listened to by the users (Wen et al. 2019). Then Apriori algorithm is applied, and the support for all the artists are calculated with mining of rules satisfying minimum support, confidence and lift threshold.
user artist sex country
0 1 red hot chili peppers f Germany
1 1 the black dahlia murder f Germany
2 1 goldfrapp f Germany
3 1 dropkick murphys f Germany
4 1 le tigre f Germany
Missing values by columns:
user 0
artist 0
sex 0
country 0
Hence, it can be seen that there are no missing values in any of the columns, and thus no filtering of instances is required.
user artist sex country
102441 6980 m.i.a. f United Kingdom
102446 6980 m.i.a. f United Kingdom
143737 9753 james brown m Germany
143746 9753 james brown m Germany
Number of unique users: 15000
Number of unique Artists: 1004
Number of unique Countries: 159
Distribution of sex:
m 0.730537
f 0.269463
Hence, it can be seen that in the sample, there are significantly a greater number of male users than female users.
Number of music by country with bar plot:
It can be seen that most music as listened to by users are from United Stated of frequency close to 60000 and then from United Kingdom, Germany, Poland which also have a significantly large frequency of over 10000.
Now, the incidence matrix is formed by using the user and artist column where each list represents a user, and the values in the list are artists songs that are listened to by the user or artists are columns (Kurnia et al. 2019). Hence, there are 15000 rows in the incidence matrix represented as a list of lists for which the first 5 rows are displayed below.
[[‘red hot chili peppers’, ‘the black dahlia murder’, ‘goldfrapp’, ‘dropkick murphys’, ‘le tigre’, ‘schandmaul’, ‘edguy’, ‘jack johnson’, ‘eluveitie’, ‘the killers’, ‘judas priest’, ‘rob zombie’, ‘john mayer’, ‘the who’, ‘guano apes’, ‘the rolling stones’], [‘devendra banhart’, ‘boards of canada’, ‘cocorosie’, ‘aphex twin’, ‘animal collective’, ‘atmosphere’, ‘joanna newsom’, ‘air’, ‘portishead’, ‘massive attack’, ‘broken social scene’, ‘arcade fire’, ‘plaid’, ‘prefuse 73’, ‘m83’, ‘the flashbulb’, ‘pavement’, ‘goldfrapp’, ‘amon tobin’, ‘sage francis’, ‘four tet’, ‘max richter’, ‘autechre’, ‘radiohead’, ‘neutral milk hotel’, ‘beastie boys’, ‘aesop rock’, ‘mf doom’, ‘the books’], [‘tv on the radio’, ‘tool’, ‘kyuss’, ‘dj shadow’, ‘air’, ‘a tribe called quest’, ‘the cinematic orchestra’, ‘beck’, ‘bon iver’, ‘röyksopp’, ‘bonobo’, ‘the decemberists’, ‘snow patrol’, ‘battles’, ‘the prodigy’, ‘pink floyd’, ‘rjd2’, ‘the flaming lips’, ‘michael jackson’, ‘mgmt’, ‘the rolling stones’, ‘late of the pier’, ‘flight of the conchords’, ‘simian mobile disco’, ‘muse’, ‘fleetwood mac’, ‘led zeppelin’], [‘dream theater’, ‘ac/dc’, ‘metallica’, ‘iron maiden’, ‘bob marley & the wailers’, ‘megadeth’, ‘children of bodom’, ‘trivium’, ‘nightwish’, ‘sublime’, ‘volbeat’], [‘lily allen’, ‘kanye west’, ‘sigur rós’, ‘pink floyd’, ‘stevie wonder’, ‘metallica’, ‘thievery corporation’, ‘iron maiden’, ‘the streets’, ‘muse’, ‘faith no more’, ‘manu chao’, ‘tenacious d’, ‘depeche mode’, ‘justin timberlake’, ‘green day’, ‘snow patrol’, ‘dream theater’, ‘u2’, ‘jay-z’, ‘type o negative’, ‘pearl jam’, ‘queen’]]
Now, the association of songs listened to by different users is constructed by Apriori algorithm with different settings of minimum support, confidence, and lift. Now, the top 3 users are those who listen to the highest number of artists, and this is calculated from the incidence matrix by getting a count of each list in the list of lists and then sorting the DataFrame.
Index of top 3 users:
User index Number of artists
13459 13459 76
11494 11494 63
928 928 55
It is seen that the top user listens to a total of 76 artists, then the 2nd user from the top listens to the music of 63 artists, and the 3rd user from the top listens to 55 music.
The Apriori algorithm is used to mine rules among artists for the top users with minimum support =0.03, minimum confidence = 0.4, and minimum lift = 2. These values are adjusted to get significant confidence on mines rules, and with the chosen parameters, 7 reasonably high confidence rules are mined as given below (Rekik et al. 2018).
Left Hand Side Right Hand Side Support Confidence Lift
3 led zeppelin pink floyd 0.032000 0.404040 3.850449
0 bob dylan the beatles 0.034467 0.497115 2.794877
5 sigur rós radiohead 0.034200 0.492795 2.733702
6 the rolling stones the beatles 0.030467 0.484110 2.721759
1 the killers coldplay 0.041067 0.418194 2.637894
2 david bowie the beatles 0.031733 0.430769 2.421866
4 led zeppelin the beatles 0.033467 0.422559 2.375706
Transaction
3 led zeppelin–pink floyd
0 bob dylan–the beatles
5 sigur rós–radiohead
6 the rolling stones–the beatles
1 the killers–coldplay
2 david bowie–the beatles
4 led zeppelin–the beatles
It can be seen from the output that the top rule is ‘User who listens bob Dylan music also has the probability of about 49.27% to listen the Beatles music’ and similarly other rules can be observed. The lowest confidence rule is ‘User who listens led zeppelin music has the probability of about 40% to listen to pink Floyd music.’ It is observed that all 7 rules have reasonably high confidence over 40% and thus can be used by radio companies to target customers to suggest music to users accordingly (Ghafari and Tjortjis 2019).
Conclusion:
Hence, by clustering analysis of COVID data of different countries, they can be optimally grouped in two clusters of unequal size. Hence, from the point of view of the government, the countries which fall in the same cluster have the same COVID characteristics, and thus in the same group, it is not needed to form different strategies to slow down COVID spread or to prevent the disease, instead one generalized strategy can be formed for all countries in the same group. After employing the strategy, the result can be observed. If needed, a few modifications can be done on a country-to-country basis. Thus, by using the clustering analysis results, the government is certainly helped in terms of time and cost effectiveness, which in turn help to form good strategies to fight COVID.
The market basket analysis as applied with Apriori algorithm on top 3 baskets or top 3 listeners of sound, it is found that there are 7 rules with strong confidence can be formed which shows a critical association between listening music of different artists. Hence, using these rules, the radio company can target the top 3 users, and when the users play a song of an artist, then the songs of other artists associated with it can be suggested to improve users’ music experience. This is likely to increase user involvement with radio and thus develop the business of radio companies further.
Babichev, S., Durnyak, B., Pikh, I. and Senkivskyy, V., 2019, May. An evaluation of the objective clustering inductive technology effectiveness implemented using density-based and agglomerative hierarchical clustering algorithms. In International Scientific Conference “Intellectual Systems of Decision Making and Problem of Computational Intelligence” (pp. 532-553). Springer, Cham.
Dai, Z. and Chang, X., 2021. Predicting Stock Return with Economic Constraint: Can Interquartile Range Truncate the Outliers?. Mathematical Problems in Engineering, 2021.
Fauzi, I.R., Rustam, Z. and Wibowo, A., 2021. Multiclass classification of leukemia cancer data using Fuzzy Support Vector Machine (FSVM) with feature selection using Principal Component Analysis (PCA). In Journal of Physics: Conference Series (Vol. 1725, No. 1, p. 012012). IOP Publishing.
Gárate-Escamila, A.K., El Hassani, A.H. and Andrès, E., 2020. Classification models for heart disease prediction using feature selection and PCA. Informatics in Medicine Unlocked, 19, p.100330.
Garikapati, P., Balamurugan, K., Latchoumi, T.P. and Malkapuram, R., 2021. A Cluster-Profile Comparative Study on Machining AlSi7/63% of SiC Hybrid Composite Using Agglomerative Hierarchical Clustering and K-Means. Silicon, 13(4), pp.961-972.
Ghafari, S.M. and Tjortjis, C., 2019. A survey on association rules mining using heuristics. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 9(4), p.e1307.
Jahwar, A.F. and Abdulazeez, A.M., 2020. Meta-heuristic algorithms for K-means clustering: A review. PalArch’s Journal of Archaeology of Egypt/Egyptology, 17(7), pp.12002-12020.
Jo, J.M., 2019. Effectiveness of normalization pre-processing of big data to the machine learning performance. The Journal of the Korea institute of electronic communication sciences, 14(3), pp.547-552.
Kurnia, Y., Isharianto, Y., Giap, Y.C. and Hermawan, A., 2019, March. Study of application of data mining market basket analysis for knowing sales pattern (association of items) at the O! Fish restaurant using apriori algorithm. In Journal of Physics: Conference Series (Vol. 1175, No. 1, p. 012047). IOP Publishing.
Rekik, R., Kallel, I., Casillas, J. and Alimi, A.M., 2018. Assessing web sites quality: A systematic literature review by text and association rules mining. International journal of information management, 38(1), pp.201-216.
Roux, M., 2018. A comparative study of divisive and agglomerative hierarchical clustering algorithms. Journal of Classification, 35(2), pp.345-366.
Sinaga, K.P. and Yang, M.S., 2020. Unsupervised K-means clustering algorithm. IEEE access, 8, pp.80716-80727.
Uddin, M.P., Mamun, M.A. and Hossain, M.A., 2021. PCA-based feature reduction for hyperspectral remote sensing image classification. IETE Technical Review, 38(4), pp.377-396.
Wen, F., Zhang, G., Sun, L., Wang, X. and Xu, X., 2019. A hybrid temporal association rules mining method for traffic congestion prediction. Computers & Industrial Engineering, 130, pp.779-787.
Yuan, C. and Yang, H., 2019. Research on K-value selection method of K-means clustering algorithm. J, 2(2), pp.226-235.
Essay Writing Service Features
Our Experience
No matter how complex your assignment is, we can find the right professional for your specific task. Contact Essay is an essay writing company that hires only the smartest minds to help you with your projects. Our expertise allows us to provide students with high-quality academic writing, editing & proofreading services.Free Features
Free revision policy
$10Free bibliography & reference
$8Free title page
$8Free formatting
$8How Our Essay Writing Service Works
First, you will need to complete an order form. It's not difficult but, in case there is anything you find not to be clear, you may always call us so that we can guide you through it. On the order form, you will need to include some basic information concerning your order: subject, topic, number of pages, etc. We also encourage our clients to upload any relevant information or sources that will help.
Complete the order formOnce we have all the information and instructions that we need, we select the most suitable writer for your assignment. While everything seems to be clear, the writer, who has complete knowledge of the subject, may need clarification from you. It is at that point that you would receive a call or email from us.
Writer’s assignmentAs soon as the writer has finished, it will be delivered both to the website and to your email address so that you will not miss it. If your deadline is close at hand, we will place a call to you to make sure that you receive the paper on time.
Completing the order and download