This report presents the finding and methods of analysis of a study on population data of Australia in 2015 as reported by WHO world, focussing on aspects of development and public health conditions. Variables indicating health parameters and development performance parameters for Australia were chosen out of the overall dataset. Taking into account the relevance of the aspect being represented by the parameters to the point of focus of the study as well as the variables missing data, the variable attributes concerning gross national income, unemployment and life expectancy at birth were narrowed down. Theoretically speaking, the literacy and overall education status of the population ought to play a part in unemployment rates. Again, female literacy also is closely found to be related to the extent of education as many young mothers tend to leave school either due to falling pregnant or because they get married and having children young instead of pursuing further studies or engaging in some form of employment. Keeping in mind all these factors, the study as discussed further was carried out. Although for developed countries employment is seen to affect fertility in positive ways. Keeping all this in mind the data for Australia was scrutinized.
The data is sourced from the database of the World Bank. The data file was given in comma separated value format and imported to R Studio and the analysis of the data was done via R Studio. The data consisted of 26 variable attributes spanning 15 years ranging from 2001 to 2015 over a number of countries listed to fall under the East Asia and Pacific region. Keeping in mind the points of consideration and objectives of the study, the data deemed relevant to the problem was been extracted and compiled into a separate dataset. The attributes that were used are the total fertility rate, total unemployment and tertiary school enrolment rates for Australia. The dataset had a few missing data and it was dealt with by replacing missing data points with the average of the existing observation for the variable. This was done in R Studio itself.
The R library packages that were used for carrying out the analysis for the study in R Studio program are:
#Package library(ggplot2) library(cluster) # Import data: Choose Manually data <- read.csv(file.choose(), sep = “,”, header = TRUE, na.strings = “..”) View(data) data1<-data for(i in 1:length(data1[1,])){ if(length(which(is.na(data[,i])==TRUE))>0){ indx<-which(is.na(data[,i])==TRUE) data1[indx,i]=mean(na.omit(data[,i])) } } View(data1) |
Table 1: Table showing the R Codes for loading the dataset, relevant libraries and dealing with missing data
This section discusses the measures used to explore the data and the corresponding outcome. The chosen variables were analysed individually. Considering the variables Total Unemployment, Tertiary school enrolment and total fertility rate the analysis was carried forward.
The analysis shows that the standard deviation of the variable is 1.265151 (table 3). The mean rate of unemployment over the years was found to be 6.082. Moreover, the boxplot in figure 1 shows that the data is negatively skewed.
#Exploratory Analysis #=================================================================# #1.Total Unemployment summary(data1$SL.UEM.TOTL.ZS) sd(data1$SL.UEM.TOTL.ZS) var(data1$SL.UEM.TOTL.ZS) boxplot(data1$SL.UEM.TOTL.ZS, main = “Total Unemployment”, xlab=”Total Unemployment”, col=4, border=1) |
Table 2: Table showing R codes for summary data and boxplot for total unemployment
Min. 4.200 1st Qu. 5.200 Median 5.800 Mean. 6.082 3rd Qu. 6.625 Max. 8.500 Variance : 1.600606 St. Dev. : 1.265151 |
Table 3: Table showing the results of summary statistics for total unemployment
Next the variable under study is the total tertiary school enrolment in Australia from 1997 to 2015. The codes used for making the necessary calculations have been provided in table 4. The mean level of tertiary school enrolment was found to be 75.45. The standard deviation in the observed overall tertiary school enrolment was found to be 6.575815.The histogram in figure 2 shows that the data is not symmetric.
#Total Tertiary Enrolment summary(data1$SE.TER.ENRR) sd(data1$SE.TER.ENRR) var(data1$SE.TER.ENRR) hist(data1$SE.TER.ENRR, main = “Tertiary School Enrollment”, xlab=”Tertiary School Enrollment”, col=4, border=1) |
Table 4: Table showing R Codes for summary statistics and histogram for Total tertiary school enrolment
Min. 67.01 1st Qu. 71.54 Median 74.28 Mean 75.45 3rd Qu. 79.51 Max. 90.31 Variance 43.2413 Std.dev. 6.575815 |
Table 5: Table showing the results of the summary statistics for Tertiary School Enrolment
Next the variable under study is the total fertility rate in Australia from 1997 to 2015. The codes used for making the necessary calculations have been provided in table 5. The mean level of total fertility was found to be 1.839. The standard deviation in the observed total rate of fertility was found to be 0.0808.The histogram in figure 3 shows that the data is not symmetric and a little bit skewed towards the left tail.
# for summary data, standard deviation and variance #3.Total Fertility Rate summary(data1$SP.DYN.TFRT.IN) sd(data1$SP.DYN.TFRT.IN) var(data1$SP.DYN.TFRT.IN) ##Boxplot boxplot(data1$SP.DYN.TFRT.IN, main = “Total Fertility Rate”, xlab=”Total Fertility Rate”, col=4, border=1) |
Table 6: Table showing R Codes for summary data and boxplot for Total Fertility Rate
Min. 1.739 1st Qu. 1.764 Median 1.827 Mean 1.839 3rd Qu. 1.918 Max. 1.984 Variance 0.08089 Std.dev. 0.006544 |
Table 7: Table showing the results of the summary statistics for Total Fertility Rate
The analysis then focuses on the relationship between the individual variables through graphical and statistical measures. The following figure shows the relationship between the two variables, viz., Unemployment and Tertiary School Enrolment. The correlation coefficient of the variables was also computed and it was found to be -0.18154. Thus it is seen that there is a slightly negative correlation in the variables. Figure 4 represents the relationship diagrammatically.
# for scatterplot plot(SP.DYN.TFRT.IN~SL.UEM.TOTL.ZS, data=data1, main=”Scatterplot of Total Fertility Rate and Unemployment”,xlab=”Total Fertility Rate”, ylab=”Unemployment”, col=2, pch=19) # to get the Pearson’s correlation coefficient cor(data1$SL.UEM.TOTL.ZS,data1$SE.TER.ENRR) |
Table 8: Table showing R Codes for Scatterplot 1
The following figure shows the relationship between the two variables, viz., total unemployment and total fertility rate. The correlation coefficient of the variables was also computed and it was found to be -0.6319. Thus it is seen that there is a moderate to high negative correlation in the variables. Figure 5 represents the relationship diagrammatically.
# to make scatterplot plot(SP.DYN.TFRT.IN~SL.UEM.TOTL.ZS, data=data1, main=”Total Fertility Rate and Unemployment”,xlab=”Total Fertility Rate”, ylab=”Unemployment”, col=2, pch=19) # to get the correlation coefficient cor(data1$SP.DYN.TFRT.IN,data1$SL.UEM.TOTL.ZS) |
Table 9: Table showing R Codes for Scatterplot 2
From results of the exploratory analysis, it was observed and deduced that the correlation between Total Unemployment and Total fertility rate is greater in magnitude than the correlation between Total Unemployment and Total rate of tertiary school enrolment.
It is then of interest to see how the fertility rates have fared in the country over the years. To do so, clustering is used to see if the years with similar rates of fertility could be identified on not and on the basis of that whether any other factor which may influence fertility could be found. The K means clustering method was employed to cluster the data. The observed fertility rates per year in the data frame are then grouped into clusters. The clusters contain the years whose fertility rate values are nearby to the overall mean value of the fertility rates of the years in each cluster (Celebi, Kingravi and Vela 2013).This means that within cluster variation is low whereas between cluster variation is high and that is the criteria for choosing the best number of clusters to represent the data. There are a number of methods and algorithms that can do this, however the algorithm used in this case is the k-means algorithm (Guha and Mishra 2016).
The R codes for the algorithm and plotting the clusters that were used are given in the following table 10. For k-means it is necessary to initially assume a certain number of clusters and then empirically check how many clusters gives the least proportion of within cluster variation divided by total variation. The one with minimum proportion of within cluster variation, which is actually a measure of the degree of homogeneity within a cluster is the criteria for choosing the best way of clustering the data. The optimum number of clusters with minimum within cluster variance as shown by figure 6 is 2. Figure 7 shows the clusters diagranmmatically.
# For k-means clustering choose optimum number of clusters wss<-rep(0,10) for(i in 1:10){ grpdatai <- kmeans(data1[,c(“SP.DYN.TFRT.IN”)],centers = i, nstart = 10) grpdatai oi = order(grpdatai$cluster) data.frame(data1[,1], grpdatai$cluster[oi]) wss[i]<-grpdatai$tot.withinss } ##Check for optimum number of clusters plot(1:10,wss,”l”,xlab=”number of clusters”,ylab=”within sum of squares”) abline(v=2,col=”red”) ##Cluster in two groups grpdata <- kmeans(data1[,c(“SP.DYN.TFRT.IN”)],centers = 2, nstart = 10) grpdata o = order(grpdata$cluster) data.frame(data1[,1], grpdata$cluster[o]) # for plotting the clustered data plot(data1[,1], data1$SP.DYN.TFRT.IN,type=”n”, xlab=”clusters”,ylab=”Total Fertility Rate”) text(x=data1[,1], y= data1$SP.DYN.TFRT.IN, labels=data1[,1],col=grpdata$cluster+1) |
Table 10: Table Showing R Codes for Clustering
Next the relationship between the two models are analysed by means of considering the linear relationship that may exist between them. Therefore a linear regression model is fit to explain the variation in one on the basis of another. The Linear regression setup considers one to be the dependent variable which is to be explained by an independent predictor variable. (Kabacoff 2015). The mathematical form of the linear regression model is expressed as:
y = β0 + β1x + ε,
Here β0 is the scale parameter which denotes the value of the to be predicted dependent variable y when the value of X is 0 or X is absent, β1 is the slope parameter of the predictor variable X and denotes the proportion of change in the dependent variable Y owing to unit change in X and finally ε is the error with respect to the actual value of Y that the model accounts for in predicting Y, that is the difference is predicted and actual Y (Montgomery, Peck and Vining 2015).
Regression 1:
The first regression model considers the two variables with total unemployment and total tertiary enrolment with unemployment as the dependent and total tertiary enrolment and the independent variable. The regression equation obtained then is:
Total Unemployment = 8.71728 – 0.03493* total tertiary enrolment + Error
This model with total tertiary enrolment explains only 3.2% (R Squared value) of the Total Unemployment in the data.
# Regression 1 Reg1 <- lm(formula = SL.UEM.TOTL.ZS~SE.TER.ENRR, data = data1) summary(Reg1) # to draw the regression line plot1 <- ggplot(data1, aes(x=SE.TER.ENRR, y=SL.UEM.TOTL.ZS)) + geom_point(shape=1) + scale_x_continuous(name = “Total Tertiary School Enrolment”) + scale_y_continuous(name = “Unemployment”)+ geom_smooth(method=lm) +theme_bw()+ ggtitle(“Regression of Unemployment on Total Tertiary Enrolment”) plot1 |
Table 11: Table showing Regression 1
Residuals: Min 1Q Median 3Q Max -1.9704 -0.6991 -0.1648 0.4819 2.4919 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 8.71728 3.20372 2.721 0.0132 * SE.TER.ENRR -0.03493 0.04231 -0.826 0.4188 — Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 1.275 on 20 degrees of freedom Multiple R-squared: 0.03296, Adjusted R-squared: -0.01539 F-statistic: 0.6816 on 1 and 20 DF, p-value: 0.4188 |
Table 12: Table showing results for Regression Analysis of Regression 1
The second regression model considers the two variables with total unemployment and total fertility rate with as total fertility rate the dependent and unemployment as the independent variable. The regression equation obtained then is:
Total Fertility Rate = 2.084 – 0.04 * Unemployment + Error
This model with Unemployment explains only 39% (R Squared value) of the Total Fertility Rate in the data.
# Regression 2 Reg2 <- lm(formula = SP.DYN.TFRT.IN~SL.UEM.TOTL.ZS, data = data1) summary(Reg2) # To draw the regression line plot2 <- ggplot(data1, aes(x=SL.UEM.TOTL.ZS, y=SP.DYN.TFRT.IN)) + geom_point(shape=1) + scale_x_continuous(name = “Total Unemployment”) + scale_y_continuous(name = “Total Fertility Rate”)+ geom_smooth(method=lm) +theme_bw()+ ggtitle(“Regression of Unemployment on Total Fertility Rate”) plot2 |
Table 13: Table showing R Codes for Regression 2
Residuals: Min 1Q Median 3Q Max -0.098122 -0.064914 0.000082 0.051542 0.112960 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 2.08432 0.06877 30.307 < 2e-16 *** SL.UEM.TOTL.ZS -0.04041 0.01108 -3.646 0.00161 ** — Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.06425 on 20 degrees of freedom Multiple R-squared: 0.3993, Adjusted R-squared: 0.3693 F-statistic: 13.3 on 1 and 20 DF, p-value: 0.001606 |
Conclusion
This analysis showed that there is a moderate relationship between unemployment and fertility rates in Australia. It was seen to be negative, that is increasing unemployment was seen to correspond with a decrease in fertility. Then it could be speculated that perhaps the years identified by the years in the cluster with high mean fertility was also years with low unemployment that has a relationship with economic prosperity. The variable tertiary enrolment was not found to have as much effect on unemployment however .
A key problem in the research was the lack of data points for some years for many variables. Hence the study had to exclude many variables. Consequentially only a subset of the data was used and the missing data were replaced by the available data. This could have potentially inflated or deflated the data and could affect the results of the analysis.
References
Celebi, M.E., Kingravi, H.A. and Vela, P.A., 2013. A comparative study of efficient initialization methods for the k-means clustering algorithm. Expert Systems with Applications, 40(1), pp.200-210.
Guha, S. and Mishra, N., 2016. Clustering data streams. In Data Stream Management (pp. 169-187). Springer Berlin Heidelberg.
Kabacoff, R., 2015. R in action: data analysis and graphics with R. Manning Publications Co..
Montgomery, D.C., Peck, E.A. and Vining, G.G., 2015. Introduction to linear regression analysis. John Wiley & Sons.
Essay Writing Service Features
Our Experience
No matter how complex your assignment is, we can find the right professional for your specific task. Contact Essay is an essay writing company that hires only the smartest minds to help you with your projects. Our expertise allows us to provide students with high-quality academic writing, editing & proofreading services.Free Features
Free revision policy
$10Free bibliography & reference
$8Free title page
$8Free formatting
$8How Our Essay Writing Service Works
First, you will need to complete an order form. It's not difficult but, in case there is anything you find not to be clear, you may always call us so that we can guide you through it. On the order form, you will need to include some basic information concerning your order: subject, topic, number of pages, etc. We also encourage our clients to upload any relevant information or sources that will help.
Complete the order formOnce we have all the information and instructions that we need, we select the most suitable writer for your assignment. While everything seems to be clear, the writer, who has complete knowledge of the subject, may need clarification from you. It is at that point that you would receive a call or email from us.
Writer’s assignmentAs soon as the writer has finished, it will be delivered both to the website and to your email address so that you will not miss it. If your deadline is close at hand, we will place a call to you to make sure that you receive the paper on time.
Completing the order and download