Population Data Analysis Of Australia In 2015: Exploring Health And Development Parameters

Data Setup

This report presents the finding and methods of analysis of a study on population data of Australia in 2015 as reported by WHO world, focussing on aspects of development and public health conditions. Variables indicating health parameters and development performance parameters for Australia were chosen out of the overall dataset. Taking into account the relevance of the aspect being represented by the parameters to the point of focus of the study as well as the variables missing data, the variable attributes concerning gross national income, unemployment and life expectancy at birth were narrowed down. Theoretically speaking, the literacy and overall education status of the population ought to play a part in unemployment rates. Again, female literacy also is closely found to be related to the extent of education as many young mothers tend to leave school either due to falling pregnant or because they get married and having children young instead of pursuing further studies or engaging in some form of employment. Keeping in mind all these factors, the study as discussed further was carried out. Although for developed countries employment is seen to affect fertility in positive ways. Keeping all this in mind the data for Australia was scrutinized.

The data is sourced from the database of the World Bank. The data file was given in comma separated value format and imported to R Studio and the analysis of the data was done via R Studio. The data consisted of 26 variable attributes spanning 15 years ranging from 2001 to 2015 over a number of countries listed to fall under the East Asia and Pacific region. Keeping in mind the points of consideration and objectives of the study, the data deemed relevant to the problem was been extracted and compiled into a separate dataset. The attributes that were used are the total fertility rate, total unemployment and tertiary school enrolment rates for Australia. The dataset had a few missing data and it was dealt with by replacing missing data points with the average of the existing observation for the variable. This was done in R Studio itself.

The R library packages that were used for carrying out the analysis for the study in R Studio program are:

ggplot2 : library for plotting data points
cluster : library used to perform k-means clustering

#Package

library(ggplot2)

library(cluster)

# Import data: Choose Manually

data <- read.csv(file.choose(), sep = “,”, header = TRUE, na.strings = “..”)

View(data)

data1<-data

for(i in 1:length(data1[1,])){

if(length(which(is.na(data[,i])==TRUE))>0){

indx<-which(is.na(data[,i])==TRUE)

data1[indx,i]=mean(na.omit(data[,i]))

}

View(data1)

Table 1: Table showing the R Codes for loading the dataset, relevant libraries and dealing with missing data

This section discusses the measures used to explore the data and the corresponding outcome. The chosen variables were analysed individually. Considering the variables Total Unemployment, Tertiary school enrolment and total fertility rate the analysis was carried forward.

The analysis shows that the standard deviation of the variable is 1.265151 (table 3). The mean rate of unemployment over the years was found to be 6.082. Moreover, the boxplot in figure 1 shows that the data is negatively skewed.

#Exploratory Analysis

#=================================================================#

#1.Total Unemployment

summary(data1$SL.UEM.TOTL.ZS)

sd(data1$SL.UEM.TOTL.ZS)

var(data1$SL.UEM.TOTL.ZS)

boxplot(data1$SL.UEM.TOTL.ZS, main = “Total Unemployment”, xlab=”Total Unemployment”, col=4, border=1)

Table 2: Table showing R codes for summary data and boxplot for total unemployment

Min. 4.200

1st Qu. 5.200

Median 5.800

Mean. 6.082

3rd Qu. 6.625

Max. 8.500

Variance : 1.600606

St. Dev. : 1.265151

Table 3: Table showing the results of summary statistics for total unemployment

Next the variable under study is the total tertiary school enrolment in Australia from 1997 to 2015. The codes used for making the necessary calculations have been provided in table 4. The mean level of tertiary school enrolment was found to be 75.45. The standard deviation in the observed overall tertiary school enrolment was found to be 6.575815.The histogram in figure 2 shows that the data is not symmetric.

#Total Tertiary Enrolment

summary(data1$SE.TER.ENRR)

sd(data1$SE.TER.ENRR)

var(data1$SE.TER.ENRR)

hist(data1$SE.TER.ENRR, main = “Tertiary School Enrollment”, xlab=”Tertiary School Enrollment”, col=4, border=1)

Exploratory Data Analysis

Table 4: Table showing R Codes for summary statistics and histogram for Total tertiary school enrolment

Min. 67.01

1st Qu. 71.54

Median 74.28

Mean 75.45

3rd Qu. 79.51

Max. 90.31

Variance 43.2413

Std.dev. 6.575815

Table 5: Table showing the results of the summary statistics for Tertiary School Enrolment

Next the variable under study is the total fertility rate in Australia from 1997 to 2015. The codes used for making the necessary calculations have been provided in table 5. The mean level of total fertility was found to be 1.839. The standard deviation in the observed total rate of fertility was found to be 0.0808.The histogram in figure 3 shows that the data is not symmetric and a little bit skewed towards the left tail.

# for summary data, standard deviation and variance

#3.Total Fertility Rate

summary(data1$SP.DYN.TFRT.IN)

sd(data1$SP.DYN.TFRT.IN)

var(data1$SP.DYN.TFRT.IN)

##Boxplot

boxplot(data1$SP.DYN.TFRT.IN, main = “Total Fertility Rate”, xlab=”Total Fertility Rate”, col=4, border=1)

Table 6: Table showing R Codes for summary data and boxplot for Total Fertility Rate

Min. 1.739

1st Qu. 1.764

Median 1.827

Mean 1.839

3rd Qu. 1.918

Max. 1.984

Variance 0.08089

Std.dev. 0.006544

Table 7: Table showing the results of the summary statistics for Total Fertility Rate

The analysis then focuses on the relationship between the individual variables through graphical and statistical measures. The following figure shows the relationship between the two variables, viz., Unemployment and Tertiary School Enrolment. The correlation coefficient of the variables was also computed and it was found to be -0.18154. Thus it is seen that there is a slightly negative correlation in the variables. Figure 4 represents the relationship diagrammatically.

# for scatterplot

plot(SP.DYN.TFRT.IN~SL.UEM.TOTL.ZS, data=data1, main=”Scatterplot of Total Fertility Rate and Unemployment”,xlab=”Total Fertility Rate”, ylab=”Unemployment”, col=2, pch=19) # to get the Pearson’s correlation coefficient

cor(data1$SL.UEM.TOTL.ZS,data1$SE.TER.ENRR)

Table 8: Table showing R Codes for Scatterplot 1

The following figure shows the relationship between the two variables, viz., total unemployment and total fertility rate. The correlation coefficient of the variables was also computed and it was found to be -0.6319. Thus it is seen that there is a moderate to high negative correlation in the variables. Figure 5 represents the relationship diagrammatically.

# to make scatterplot

plot(SP.DYN.TFRT.IN~SL.UEM.TOTL.ZS, data=data1, main=”Total Fertility Rate and Unemployment”,xlab=”Total Fertility Rate”, ylab=”Unemployment”, col=2, pch=19)

# to get the correlation coefficient

cor(data1$SP.DYN.TFRT.IN,data1$SL.UEM.TOTL.ZS)

Table 9: Table showing R Codes for Scatterplot 2

From results of the exploratory analysis, it was observed and deduced that the correlation between Total Unemployment and Total fertility rate is greater in magnitude than the correlation between Total Unemployment and Total rate of tertiary school enrolment.

It is then of interest to see how the fertility rates have fared in the country over the years. To do so, clustering is used to see if the years with similar rates of fertility could be identified on not and on the basis of that whether any other factor which may influence fertility could be found. The K means clustering method was employed to cluster the data. The observed fertility rates per year in the data frame are then grouped into clusters. The clusters contain the years whose fertility rate values are nearby to the overall mean value of the fertility rates of the years in each cluster (Celebi, Kingravi and Vela 2013).This means that within cluster variation is low whereas between cluster variation is high and that is the criteria for choosing the best number of clusters to represent the data. There are a number of methods and algorithms that can do this, however the algorithm used in this case is the k-means algorithm (Guha and Mishra 2016).

The R codes for the algorithm and plotting the clusters that were used are given in the following table 10. For k-means it is necessary to initially assume a certain number of clusters and then empirically check how many clusters gives the least proportion of within cluster variation divided by total variation. The one with minimum proportion of within cluster variation, which is actually a measure of the degree of homogeneity within a cluster is the criteria for choosing the best way of clustering the data. The optimum number of clusters with minimum within cluster variance as shown by figure 6 is 2. Figure 7 shows the clusters diagranmmatically.

# For k-means clustering choose optimum number of clusters

wss<-rep(0,10)

for(i in 1:10){

grpdatai <- kmeans(data1[,c(“SP.DYN.TFRT.IN”)],centers = i, nstart = 10)

grpdatai

oi = order(grpdatai$cluster)

data.frame(data1[,1], grpdatai$cluster[oi])

wss[i]<-grpdatai$tot.withinss

}

##Check for optimum number of clusters

plot(1:10,wss,”l”,xlab=”number of clusters”,ylab=”within sum of squares”)

abline(v=2,col=”red”)

##Cluster in two groups

grpdata <- kmeans(data1[,c(“SP.DYN.TFRT.IN”)],centers = 2, nstart = 10)

grpdata

o = order(grpdata$cluster)

data.frame(data1[,1], grpdata$cluster[o])

# for plotting the clustered data

plot(data1[,1], data1$SP.DYN.TFRT.IN,type=”n”, xlab=”clusters”,ylab=”Total Fertility Rate”)

text(x=data1[,1], y= data1$SP.DYN.TFRT.IN, labels=data1[,1],col=grpdata$cluster+1)

One Variable

Table 10: Table Showing R Codes for Clustering

Next the relationship between the two models are analysed by means of considering the linear relationship that may exist between them. Therefore a linear regression model is fit to explain the variation in one on the basis of another. The Linear regression setup considers one to be the dependent variable which is to be explained by an independent predictor variable. (Kabacoff 2015). The mathematical form of the linear regression model is expressed as:

y = β₀ + β₁x + ε,

Here β₀is the scale parameter which denotes the value of the to be predicted dependent variable y when the value of X is 0 or X is absent, β₁ is the slope parameter of the predictor variable X and denotes the proportion of change in the dependent variable Y owing to unit change in X and finally ε is the error with respect to the actual value of Y that the model accounts for in predicting Y, that is the difference is predicted and actual Y (Montgomery, Peck and Vining 2015).

Regression 1:

The first regression model considers the two variables with total unemployment and total tertiary enrolment with unemployment as the dependent and total tertiary enrolment and the independent variable. The regression equation obtained then is:

Total Unemployment = 8.71728 – 0.03493* total tertiary enrolment + Error

This model with total tertiary enrolment explains only 3.2% (R Squared value) of the Total Unemployment in the data.

# Regression 1

Reg1 <- lm(formula = SL.UEM.TOTL.ZS~SE.TER.ENRR, data = data1)

summary(Reg1)

# to draw the regression line

plot1 <- ggplot(data1, aes(x=SE.TER.ENRR, y=SL.UEM.TOTL.ZS)) + geom_point(shape=1) + scale_x_continuous(name = “Total Tertiary School Enrolment”) + scale_y_continuous(name = “Unemployment”)+ geom_smooth(method=lm) +theme_bw()+ ggtitle(“Regression of Unemployment on Total Tertiary Enrolment”)

plot1

Table 11: Table showing Regression 1

Residuals:

Min 1Q Median 3Q Max

-1.9704 -0.6991 -0.1648 0.4819 2.4919

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 8.71728 3.20372 2.721 0.0132 *

SE.TER.ENRR -0.03493 0.04231 -0.826 0.4188

—

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.275 on 20 degrees of freedom

Multiple R-squared: 0.03296, Adjusted R-squared: -0.01539

F-statistic: 0.6816 on 1 and 20 DF, p-value: 0.4188

Table 12: Table showing results for Regression Analysis of Regression 1

The second regression model considers the two variables with total unemployment and total fertility rate with as total fertility rate the dependent and unemployment as the independent variable. The regression equation obtained then is:

Total Fertility Rate = 2.084 – 0.04 * Unemployment + Error

This model with Unemployment explains only 39% (R Squared value) of the Total Fertility Rate in the data.

# Regression 2

Reg2 <- lm(formula = SP.DYN.TFRT.IN~SL.UEM.TOTL.ZS, data = data1)

summary(Reg2)

# To draw the regression line

plot2 <- ggplot(data1, aes(x=SL.UEM.TOTL.ZS, y=SP.DYN.TFRT.IN)) + geom_point(shape=1) + scale_x_continuous(name = “Total Unemployment”) + scale_y_continuous(name = “Total Fertility Rate”)+ geom_smooth(method=lm) +theme_bw()+ ggtitle(“Regression of Unemployment on Total Fertility Rate”)

plot2

Table 13: Table showing R Codes for Regression 2

Residuals:

Min 1Q Median 3Q Max

-0.098122 -0.064914 0.000082 0.051542 0.112960

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 2.08432 0.06877 30.307 < 2e-16 ***

SL.UEM.TOTL.ZS -0.04041 0.01108 -3.646 0.00161 **

—

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.06425 on 20 degrees of freedom

Multiple R-squared: 0.3993, Adjusted R-squared: 0.3693

F-statistic: 13.3 on 1 and 20 DF, p-value: 0.001606

Conclusion

This analysis showed that there is a moderate relationship between unemployment and fertility rates in Australia. It was seen to be negative, that is increasing unemployment was seen to correspond with a decrease in fertility. Then it could be speculated that perhaps the years identified by the years in the cluster with high mean fertility was also years with low unemployment that has a relationship with economic prosperity. The variable tertiary enrolment was not found to have as much effect on unemployment however .

A key problem in the research was the lack of data points for some years for many variables. Hence the study had to exclude many variables. Consequentially only a subset of the data was used and the missing data were replaced by the available data. This could have potentially inflated or deflated the data and could affect the results of the analysis.

References

Celebi, M.E., Kingravi, H.A. and Vela, P.A., 2013. A comparative study of efficient initialization methods for the k-means clustering algorithm. Expert Systems with Applications, 40(1), pp.200-210.

Guha, S. and Mishra, N., 2016. Clustering data streams. In Data Stream Management (pp. 169-187). Springer Berlin Heidelberg.

Kabacoff, R., 2015. R in action: data analysis and graphics with R. Manning Publications Co..

Montgomery, D.C., Peck, E.A. and Vining, G.G., 2015. Introduction to linear regression analysis. John Wiley & Sons.

Turn in your highest-quality paper
Get a qualified writer to help you with

“ Population Data Analysis Of Australia In 2015: Exploring Health And Development Parameters ”

Get high-quality paper

NEW! AI matching with writer