Statistics And Data Analysis Concepts

Population and Sample

This paper discusses the concepts of statistics and data analysis. It entails the understanding of statistics terms, various tools used in analysis of data and general statistics. Statistics is defined as a methodology whereby mathematicians and statisticians use for collecting, analyzing, interpretation, and making inferences about a sample of data or from information (Aberson, 2010). Therefore, from the above definition, it is clear that statistics is more than tabulation of the numbers and graphical presentation of information. In detail, statistical methods are used in coming up with:

The kind and how much data is needed to be collected
How the data should be organized and summarized
How and which analysis should be carried and the conclusions to be drawn
How to assess the strength of conclusions and evaluation of their uncertainty.

In conclusion, statistics provides the methodology for,

Design: the planning on how to carry out research studies
Description: summarizing and exploration of data
Inference: Making of predictions and the generalization about a phenomena represented by data

Population and sample are basic concepts used in statistics. Population is characterized as a set of all individuals, subjects, or objects that an investigator is interested on during the study. Sample is defined as a set of individuals from the population that will be involved in a study (Agarwal).

Descriptive and inferential statistics are the major types of statistics. Descriptive statistics is a branch of statistics that is devoted summarizing and description of data while inferential statistics is a branch of statistics that is concerned with making of inference about population (Fraser, 2012). In general, descriptive statistics consists of methods used in organization and summarizing of information while inferential statistics consists of methods used in drawing of conclusions and measuring reliability of the conclusions about the population under study (Brase, 2013). Descriptive statistics consists of measures of central tendencies that comprise of mean, median, mode, range, minimum and maximum values, variance, and standard deviation. Descriptive statistics also comprises of construction of tables, charts, and graphs. Inferential statistics consists of methods such as point estimation, hypothesis testing, and interval estimation where all are based on the probability theory (Friedman, 2010).

Features of the population that are under investigation are summarized as numerical parameters. Therefore, the research problem becomes as an investigation of the values of the parameters (Givens, 2013). The population parameters are usually unknown and the sample statistics are used in making inference about the parameters. In general, a statistic is used to make an inference about an unknown parameter (Daniel, 2010).

The main objective of statistics is to understand what the data contains. Below are the steps to be followed in any data analysis:

A variable is defined as any measurable characteristic that varies from individual members of population. The main types of variables in statistics are quantitative and qualitative variables. Quantitative variables include height, weight, length, and width. Quantitative variables may be classified as continuous or discrete variables. Qualitative variables include eye color, marital status, sex, and hair color. Qualitative variables may be classified as either nominal or ordinal variables (Field, 2014).

The data used in this paper is from an experimental study that intended to investigate the relationship between age, gender, type of chest pain, amount of blood sugar and the class of the subject whether sick or healthy (Knopov, 2012). The data is obtained from a web resource: https://mercury.webster.edu/aleshunas/Data%20Sets/Supplemental%20Excel%20Data%20Sets.htm

The dataset comprises of 100 subjects with the following variables, age, gender, chest type pain, blood pressure, whether the fasting blood sugar is less than 120 and the class of a patient. Age, and blood sugar are quantitative variables while gender, chest pain type, and the class of the subject are qualitative variables.

Descriptive and Inferential Statistics

Table 1

	age	blood pressure
Valid	100	100
Missing	0	0

Table 1 above indicates the sample size of the study undertaken. The results in the table above indicates there were 100 subjects in the study.

Table 2

Descriptive Statistics
	N	Range	Minimum	Maximum	Mean	Std. Deviation	Variance	Kurtosis
	Statistic	Statistic	Statistic	Statistic	Statistic	Statistic	Statistic	Statistic	Std. Error
age	100	34	37	71	54.76	8.316	69.154	-.882	.478
blood pressure	100	76	104	180	132.37	15.048	226.437	.098	.478
Valid N (list wise)	100

Table 2 above represents the descriptive statistics of the quantitative variables age and blood pressure. The subject with the lowest age was 37 years while the oldest was 71 years old. The mean age of the study 54.76 years which is approximately 55 years. The standard deviation of age was 8.316. The subject with highest blood pressure had 180 while the patient with the lowest had a blood pressure of 104. The standard deviation of blood pressure was 15.048.

Table 3

Statistics
	age	blood pressure
N	Valid	100	100
Missing	0	0
Mean	54.76	132.37
Median	56.00	130.00
Mode	44^a	130
Std. Deviation	8.316	15.048
Variance	69.154	226.437
Range	34	76
a. Multiple modes exist. The smallest value is shown

Table 3 above shows the measures of central tendencies of the quantitative variables. Age had a median of 56, mode of 44, and a range of 34. Therefore, majority of the subjects under study were aged 44 years. Blood pressure had a median of 130 and a mode of 130. Therefore, majority of subjects recorded a blood pressure of 130.

Fig 1 and fig 2 below represents histograms of age and blood pressure respectively. From the histograms below we can conclude that the data is normally distributed as neither of the two variables is skewed. Blood pressure has two values as outliers while age has none.

Fig 1 Age histogram

Fig 2 Blood Pressure histogram

Table 4

sex
	Frequency	Percent	Valid Percent	Cumulative Percent
Valid	Female	29	29.0	29.0	29.0
Male	71	71.0	71.0	100.0
Total	100	100.0	100.0

Table 4 above shows the distribution of subjects by gender. This is illustrated in fig 2 below

Fig 2

Fig 2 indicates male were majority 71% and female 29%.

The inferential statistics discussed under this study include Chi-Square test of independence, linear regression, and test of means. Before embarking on inferential statistics, it is wise to test whether the data follows normality in order to ascertain whether to use parametric or non-parametric techniques in analysis (Lee).

The hypothesis for testing normality is as follows

H₀: The data follows normality

H₁: The data does not follow normality

Table 5

Tests of Normality
	Kolmogorov-Smirnov^a	Shapiro-Wilk
Statistic	df	Sig.	Statistic	df	Sig.
blood pressure	.113	100	.423	.971	100	.526
a. Lilliefors Significance Correction

Table 5 above indicates the results for testing normality of data. The Shapiro-Wilks p-value from the table above (0.526) is greater than the level of significance at 0.05. Therefore, we fail to reject the null hypothesis and conclude that the data follows normality. The reason behind using Shapiro-Wilk test instead of Kolmogorov-Smirnov is the sample size, since the sample size is greater than 25 we use Shapiro-Wilk test (Machin, 2010).

Chi-Square Test

Chi-Square test is a statistical test used to test the association between two variables (Paulk, 2012). The hypothesis used in testing for association is as follows:

H₀: Gender and type of chest pain are independent/ there is no significant association between gender and type of chest pain

H₁: Gender and type of chest pain are not independent/ there is a significant association between gender and type of chest pain.

Table 6

sex * Fasting blood sugar <120 Crosstabulation
	Fasting blood sugar <120	Total
False	True
sex	Female	Count	26	3	29
% within sex	89.7%	10.3%	100.0%
% within Fasting blood sugar <120	29.9%	23.1%	29.0%
% of Total	26.0%	3.0%	29.0%
Male	Count	61	10	71
% within sex	85.9%	14.1%	100.0%
% within Fasting blood sugar <120	70.1%	76.9%	71.0%
% of Total	61.0%	10.0%	71.0%
Total	Count	87	13	100
% within sex	87.0%	13.0%	100.0%
% within Fasting blood sugar <120	100.0%	100.0%	100.0%
% of Total	87.0%	13.0%	100.0%

Table 6 above indicates that both males and the females had fasting blood sugar that is more than 120.

Chi-Square Test Table

Table 7

Chi-Square Tests
	Value	df	Asymp. Sig. (2-sided)	Exact Sig. (2-sided)	Exact Sig. (1-sided)
Pearson Chi-Square	.255^a	1	.614
Continuity Correction^b	.031	1	.860
Likelihood Ratio	.265	1	.607
Fisher’s Exact Test				.751	.444
N of Valid Cases	100
a. 1 cells (25.0%) have expected count less than 5. The minimum expected count is 3.77.
b. Computed only for a 2×2 table

Table 7 represents the various types of tests under chi-square, our interest from the table above is the “Pearson Chi-Square”. From the above results, Pearson Chi-Square value is 0.255 with a p-value of 0.614, since the p-value is greater than the level of significance at 0.05, we fail to reject the null hypothesis and conclude that there is statistically significant association between Gender and whether the fasting blood sugar is less than 120 (Pons).

Table 8

Symmetric Measures
	Value	Approx. Sig.
Nominal by Nominal	Phi	.050	.614
Cramer’s V	.050	.614
N of Valid Cases	100

Both Cramer’s V and Phi tests the strength of association between variables. In table 8, the strength of association between the two variables (0.050) is very weak.

Table 9

blood pressure * sex Crosstabulation
			sex		Total
			Female	Male
blood pressure	104	Count	0	1	1
		% within blood pressure	.0	1.0	1.0
		% within sex	.0	.0	.0
		% of Total	.0	.0	.0
	105	Count	1	0	1
		% within blood pressure	1.0	.0	1.0
		% within sex	.0	.0	.0
		% of Total	.0	.0	.0
	108	Count	1	0	1
		% within blood pressure	1.0	.0	1.0
		% within sex	.0	.0	.0
		% of Total	.0	.0	.0
	110	Count	0	7	7
		% within blood pressure	.0	1.0	1.0
		% within sex	.0	.1	.1
		% of Total	.0	.1	.1
	112	Count	0	2	2
		% within blood pressure	.0	1.0	1.0
		% within sex	.0	.0	.0
		% of Total	.0	.0	.0
	115	Count	0	1	1
		% within blood pressure	.0	1.0	1.0
		% within sex	.0	.0	.0
		% of Total	.0	.0	.0
Total	Count	29	71	100
% within blood pressure	.3	.7	1.0
% within sex	1.0	1.0	1.0
% of Total	.3	.7	1.0

Table 9 below indicates there is difference in blood pressure between male and females.

Chi-Square Tests Table

The Chi-Square Test below indicates that there is no statistically significant association between gender and blood sugar.

Table 9

Chi-Square Tests
	Value	df	Asymp. Sig. (2-sided)
Pearson Chi-Square	27.510^a	25	.331
Likelihood Ratio	33.753	25	.113
N of Valid Cases	100
a. 48 cells (92.3%) have expected count less than 5. The minimum expected count is .29.

Table 10

Symmetric Measures
	Value	Approx. Sig.
Nominal by Nominal	Phi	.525	.331
Cramer’s V	.525	.331
N of Valid Cases	100

Table 10 tends to differ on the association between gender and blood pressure. The Cramer’s V value indicates a strong association between the two variables (Vogt, 2012).

Conclusion

The study above reveals interesting facts about the relationship between gender, age, blood pressure levels, and classification of subjects as either sick or healthy. However, more studies should be conducted in order to come up with substantial evidence above.

References

Aberson. (2010). Applied power analysis for the behavioral sciences. New York: Routledge Academic.

Agarwal, B. L. (n.d.). Basic Statistics.

Brase, C. H. (2013). Understanding basic statistics. Australia: Cole Cengage Learning.

Daniel, W. W. (2010). Biostatistics. Chichseter: John Wiley.

Field, A. P. (2014). Discovering statistics using R. London: Sage.

Fraser. (2012). Business Statistics for competitive advantage with Excel 2010. New York: Springer.

Friedman, L. M. (2010). Fundamentals of clinical trials. New York: Springer.

Givens, G. H. (2013). Computational statistics. Hoboken: Wiley.

Knopov, P. S. (2012). Regression Analysis Under A Priori Parameter Restrictions. New York: Springer-Verlag.

Lee, E. T. (n.d.). Statistical methods for survival data analysis.

Machin, D. A. (2010). Randomized clinical trials. West Sussex: Wiley-Blackwell.