Covid-19 pandemic has impacted brutally on economics of countries worldwide. GDP and GDP per capita have been lowered due to Covid-19 pandemic. Georgiou, et., al,(2020) discussed that covid-19 outbreak and pandemic has changed to an exceptionally extent in every aspect of humans activity. Hutagalung, et., al,(2021) discussed that number of confirmed and deaths cases due to the covid-19 are increasing in southeast Asia, and K means clustering has been used; deaths data has been collected from WHO and data is divided by Hutagalung, et., al,(2021) into 3 clusters such as “High”, “Medium”, and “low”.
The aim of the paper is to understand, and analyze the association between the severity of Covid-19 pandemic and WDI indicators. Secondary data has been collected from the World Bank Database. For analysis the collected dataset requires the data-preprocessing to make data ready for the regression, classification, and Regression techniques/Algorithms implementation. Clustering performed to understand how WDI indicators clustered as well as clusters aggregate results. Logistic regression, LDA, QDA, and Multivariate logistic regression has been used to classify the covid-19 deaths into two and four categories in R-studio.
Dataset includes 20 variables and 186 records with unique Country Names.
Data has a lot of missing values as well as wrong defined type of variables, variable covid.deaths, and comp.education is the numerical variable but in data it was showing string, so changed the data type from string/character to numeric in R (See, Appendix Image 1.0). A lot of missing values exists in the data, to treat the missing values filled the missing information with average value to those variables which consists no outliers. Variable “Covid.deaths”, “Life.expec”, “Birth.rate”, “Water.services” has no outliers
Fig 1.0: Boxplot of Variables with No outliers
Variable with outliers missing values has been filled by median value of the variable. Rest all variables “Elect.access”, “Pop.growth”, “Pop.density”, etc. all the variable contains the outliers, and hence missing values has been treated using median values of variable.
Fig 1.1: Boxplot with Outliers
Outliers could be ambiguous for model. To treat the outliers capping method has been used. At 10% level of significance some outliers has been treated but if range is extended from 10% to 15% level of significance then it almost all the outliers will get treated but still some of the variable was showing some outliers, now in this case those values are not outliers, those values has different group of cluster. Increase percentage of Level of significance is not good it will lead to errors tolerance, so max to max kept Level of Significance at 10%.
Correlation Plot is the plot which explains the relationship between all the variables. Covid deaths are highly/moderately negatively related with “Mortality Rate”, “Population growth”, and “Birth Rate”. Covid deaths are highly/moderately positively related with “Life.expec”, “Elect.access”, “Primary”, and “water services”. Relation between “Mortality rate”,and “Life.expec” are highly negative (Refer Fig 1.2).
Fig 1.2: Correlation Plot
K-mean cluster is an unsupervised machine learning algorithm, also known as data mining technique. K-mean cluster algorithms create “K” clusters of population/data. Value of K is decided based on gap statistics; optimal number of cluster is the value of K. K is the number of clusters build. As per Fig 1.3, value of K=7.
Fig 1.3: Optimal Number of Clusters
7 clusters have been built, and different 186 countries have been divided into 7 clusters (Refer Fig1.4).
Fig 1.4: K-mean Cluster Plot
These clusters have been built by using Euclidean distance method and in order to use this algorithm transformed the preprocessed data to scaled data, so that no ambiguities will occur during the clustering process due to the different type of measurement.
Fig 1.5: Aggregate of K-mean Clusters
As per Fig 1.5, Maximum Average number of covid-19 deaths is in the cluster-4, minimum Average number of covid-19 deaths is in the cluster-5. Maximum Average number of Life expectancy is in the cluster-2, minimum Average number of Life expectancy is in the cluster-5. Maximum Average number of Electricity access is in the cluster-2 and cluster-3; minimum Average number of electricity access is in the cluster-5. Fig 1.5, explains the Average number of individual variables for different defined clusters and in every clusters different unique countries has been divided. Between clusters characteristics are homogenous but within cluster characteristics are heterogeneous.
Logistic Regression is a supervised machine learning algorithm, and it is a classifier because it classifies the two categories. Target variable or dependent variable has two categories (Binary variable). Transformed the covid.deaths variable into binary variable by using mean values, If covid.death value is less than 292 then it is categorized as “Low” (Low number of deaths), and if covid.death value if greater than or equal to 292 then it is categorized as “High” (High number of deaths) (Refer Table 1.0). Category of covid.death variable “Low” and “High” has been encoded into 0 and 1 using label encoder in R.
Table 1.0: Statistics of Covid.deaths Variable
Minimum |
1st Quartile |
Median |
Mean |
3rd Quartile |
Maximum |
46 |
167.5 |
307 |
292.3 |
307 |
669 |
Data has been divided into 70% train and 30% test data, on 70% data logistic model has been trained and tested the model using rest 30% of the data. Accuracy of train Data is 82.31% while accuracy on test data is 85.71% (Refer Appendix Image 1.1).
Multiclass algorithms are also known as multivariate algorithms because in this type of algorithms multiple target or dependent variable exists for analysis. LDA, QDA, and Logistic Regression have been used under multiclass algorithm. Four categories has been created using statistics of variable covid.deaths. As per Table 1.0, if Covid.deaths variable value is less than or equal to 46 then it is categorized as “Excessively low” (Excessively low number of deaths), if Covid.deaths variable value is greater than 46 and less than or equal to 167 then it is categorized as “low” (Low number of deaths), if Covid.deaths variable value is greater than 167 and less than or equal to 307 then it is categorized as “High” (High number of deaths), and if Covid.deaths variable value is Greater than 307 then it is categorized as “Excessively High” (Excessively High number of deaths). Encoded this variable using Label encoder For LDA algorithm, 0: “High”, 1: “Low”, 2: “Excess High”, 3: “Excess Low”.
For QDA and Logistic Regression Algorithm dummy variables has been created by using dependent variable (Covid.deaths.cat4), and 4 dummy variables has been created with the name “Excess_High”, “High”, “low”, “Excess_Low”. QDA (Quadratic discriminant Analysis) works with equal number of class, but our defined category has unequal number of samples in dependent variable, so to avoid error dummy variable has been used and built multiple or 4 models of QDA (Refer Appendix). LDA is a Linear Discriminant Analysis, it is a multiclass algorithm, in this algorithm there is no such issue exists, and LDA has built only one model with all 4 categories in the dependent variable.
As per Table 1.1,
Table 1.1: Accuracy of Multiclass Algorithms |
||
Models |
Train.Accuracy |
Test.Accuracy |
LDA |
72.31% |
60.71% |
QDA_ExcsessHigh |
90% |
71.43% |
QDA_High |
89.23% |
80.36% |
QDA_ExcsessLow_Low |
80.77% |
80.36% |
QDA_LOW |
98.46% |
89.29% |
LogisticRegression_ExcessHigh |
81.54% |
73.21% |
LogisticRegression_High |
80.77% |
73.21% |
LogisticRegression_ExcessLow |
93.08% |
87.5% |
LogisticRegression_Low |
92.31% |
78.57% |
Result of McNemar test has been displayed in confusion matrix report, it is a non-parametric test also known as chi-square test, Null hypothesis for this test is “There is no significant difference between Low and High number of deaths across the countries”, and Alternate Hypothesis is “There is a significant difference between Low and High number of deaths across the countries”. Pvalue of test for train and test data is 0.4042, and 0.28884 respectively, which is much greater than 0.05, test is failed to reject the null hypothesis and conclude that “There is no significant difference between Low and High number of deaths across the countries”. In the model Only “Intercept”, “GDP capita”, “Population total”, and “Birth rate” variables coefficient is significant because p-value is less than 0.05 that leads to rejection of null hypothesis and conclude that coefficients are significant, Rest of the variables coefficient are not significant.
The Results of McNemar test for all categories of algorithms has been displayed in confusion matrix report (Refer Appendix).
Conclusions
As per the Above Analysis covid deaths are negatively related with Mortality rate i.e. if covid deaths increase (Decrease) then Mortatlity rate will get decrease (Increase), covid deaths are also negatively related with Birth rate i.e. if covid deaths increase (Decrease) then Birth rate will get decrease (Increase). Correlation between Mortality rate and Life expectancy is highly negative i.e. if Mortality rate increases (Decreases) then Life expectancy will decrease (Increase). Optimal Number of cluster is 7 for unique countries as per the other WDI indicators. The best model to categorize the covid deaths around the world based on WDI indicators is Multiclass QDA and Multiclass logistic Regression as per the accuracy score (Refer Table 1.1), QDA algorithm works very well on the data to classify the category based on WDI indicators.
References
Abdullah, D., Susilo, S., Ahmar, A. S., Rusli, R., & Hidayat, R. (2021). The application of K-means clustering for province clustering in Indonesia of the risk of the COVID-19 pandemic based on COVID-19 data. Quality & Quantity, 1-9. https://doi.org/10.1007/s11135-021-01176-w
Georgiou, K., Mittas, N., Angelis, L., & Chatzigeorgiou, A. (2020, August). A preliminary study of knowledge sharing related to covid-19 pandemic in stack overflow. In 2020 46th Euromicro Conference on Software Engineering and Advanced Applications (SEAA) (pp. 517-520). IEEE. doi: 10.1109/SEAA51224.2020.00086.
Hutagalung, J., Ginantra, N. L. W. S. R., Bhawika, G. W., Parwita, W. G. S., Wanto, A., & Panjaitan, P. D. (2021, February). COVID-19 Cases and Deaths in Southeast Asia Clustering using K-Means Algorithm. In Journal of Physics: Conference Series (Vol. 1783, No. 1, p. 012027). IOP Publishing.
Liang, J., Bi, G., & Zhan, C. (2020). Multinomial and ordinal Logistic regression analyses with multi-categorical variables using R. Annals of translational medicine, 8(16). doi: 10.21037/atm-2020-57
Mahanty, C., Kumar, R., & Mishra, B. K. (2021). Analyses the effects of COVID-19 outbreak on human sexual behaviour using ordinary least-squares based multivariate logistic regression. Quality & Quantity, 55(4), 1239-1259. https://doi.org/10.1007/s11135-020-01057-8
Medina-Mendieta, J. F., Cortés-Cortés, M., & Cortés-Iglesias, M. (2020). COVID-19 forecasts for Cuba using logistic regression and gompertz curves. MEDICC review, 22(3), 32-39. https://doi.org/10.37757/MR2020.V22.N3.8
Niyakan, S., & Qian, X. (2021). COVID-Datathon: Biomarker identification for COVID-19 severity based on BALF scRNA-seq data. arXiv preprint arXiv:2110.04986. https://doi.org/10.48550/arXiv.2110.04986
Pastrana, T., De Lima, L., Pettus, K., Ramsey, A., Napier, G., Wenk, R., & Radbruch, L. (2021). The impact of COVID-19 on palliative care workers across the world: A qualitative analysis of responses to open-ended questions. Palliative & supportive care, 19(2), 187-192. https://doi.org/10.1017/S1478951521000298
Sunori, S. K., Negi, P. B., Maurya, S., Juneja, P., & Rana, A. (2021, January). K-Means Clustering of Ambient Air Quality Data of Uttarakhand, India during Lockdown Period of Covid-19 Pandemic. In 2021 6th International Conference on Inventive Computation Technologies (ICICT) (pp. 1254-1259). IEEE. doi: 10.1109/ICICT50816.2021.9358627.
Tena, A., Clarià, F., & Solsona, F. (2022). Automated detection of COVID-19 cough. Biomedical Signal Processing and Control, 71, 103175. https://doi.org/10.1016/j.bspc.2021.103175
Zhang, T. (2020). Data mining can play a critical role in COVID-19 linked mental health studies. Asian journal of psychiatry, 54, 102399. doi: 10.1016/j.ajp.2020.102399
Zhang, Z. (2016). Residuals and regression diagnostics: focusing on logistic regression. Annals of translational medicine, 4(10). doi: 10.21037/atm.2016.03.36
Essay Writing Service Features
Our Experience
No matter how complex your assignment is, we can find the right professional for your specific task. Contact Essay is an essay writing company that hires only the smartest minds to help you with your projects. Our expertise allows us to provide students with high-quality academic writing, editing & proofreading services.Free Features
Free revision policy
$10Free bibliography & reference
$8Free title page
$8Free formatting
$8How Our Essay Writing Service Works
First, you will need to complete an order form. It's not difficult but, in case there is anything you find not to be clear, you may always call us so that we can guide you through it. On the order form, you will need to include some basic information concerning your order: subject, topic, number of pages, etc. We also encourage our clients to upload any relevant information or sources that will help.
Complete the order formOnce we have all the information and instructions that we need, we select the most suitable writer for your assignment. While everything seems to be clear, the writer, who has complete knowledge of the subject, may need clarification from you. It is at that point that you would receive a call or email from us.
Writer’s assignmentAs soon as the writer has finished, it will be delivered both to the website and to your email address so that you will not miss it. If your deadline is close at hand, we will place a call to you to make sure that you receive the paper on time.
Completing the order and download