Examples And Applications Of Classification And Regression Analysis In Data Science

Handling Imbalance Data in Classification Techniques

Imbalance data is a common problem in classification; it is created when classes are not of equal size/ volume (Hilbe 2009). One example, the daily attendance register for employees in an office setting, you can get that out of 200 individuals 4 were absent and 196 were present on 18th of March 2018. Another example is there are four departments (A, B, C, and D) that generate the same product for an organization. According to data statistics collected by the organization the overall product yield can be attributed 50% to A, 20% to B, 12% to C, and 18% to D. These two examples show how common the issue of imbalance data is in the world of business. To address this problem there are two techniques we can employ such as random under-sampling and re-sampling. Random under-sampling seeks to achieve equality in class distribution by through the random negation of the majority class. While, re-sampling technique calls for the increment of the minority classes or an alternative would be to decrease the majority classes in order to balance the classes (Hilbe 2009).

Question 2

In a logistic regression it is impossible to get a single line that goes through all the point thereby indicating the line of best fit. Since, it is impossible to get the value of 1 for R squared there is no use of R-squared and adjusted R-squared in a regression analysis. As such, special tests have been developed to tackle this problem; for example, McFadden’s pseudo-R-squared (Hilbe 2009).

Question 3

Logistic regression can be used by business management because they can assign values of 0s and 1s to data to distinguish the classes. For example, management can use 1s to denote employs who attended a meeting and 0s to indicate those who were absent from the meeting. Another example of where logistic regression can be applied in a business setting is when information is being collected on how many employees received bonuses. The individuals that received bonuses will be denoted by 1s and those who did not get bonuses will be represented by 0s. Therefore, logistic regression is used in organization to analysis qualitative data with fixed responses/choices that can be assigned numerical value.

Question 4

Each of the levels of the first explanatory variable (X₁) will be assigned numerical values that are unique. For example, 0 for low, 1 for average, 2 for high, and 3 for very high. Likewise, the same will be done for the second explanatory variable (X₂); As such, the levels will be assigned values like 0=Sydney, 1=Melbourne, and 2=Brisbane. It is easy to see that the same number allocation system can be used on different variables with unrelated data; given X₁ deals with a ranking system and X₂ deals with Australian cities. Since we have two independent variables where will be three coefficients i.e. B₀, B₁, and B₂. It is important to indicate that B₁ and B₂ are the coefficients for X₁ and X₂ respectively.

Question 5

Part (a)

KNN models are based on a non-parametric learning technique through which we attempt to predict the value of give variable based on a training set. The first step involves the evaluation of similarity through the use of distance functions like Euclidean. The formula we will employ is

Using Logistic Regression in Data Science

The second step deals with finding the K-nearest neighbours. For instance, you get the five most closest to the desired value and then choose which spending level best suits the customer in question. We can also chart a graph to demonstrate the distance values to best see which ones are closest to that of the new customer.

Part (b)

Yes, it will increase because we will be given a wide scope from where to choose the new customers spending. Moreover, using the CONTIF function in Microsoft Excel it is clear that 100% of all customers in the data spent more than $500; As such, it is very likely that this new customer will spend at least $500.

Part (c)

The type of product being purchased information is omitted as such that first column will be ignore in our calculation. After calculation the new female customer had a distance of 0 with a pre-exist customer. Therefore, we will conclude the customer is most likely to spend $938

Question 6

Part (a)

The best thing would be to analysis the data and compare the variables to discern their relationship. For instance, we can compare how much of a give type of repair was performed by a particular repair person. According to the data and chart below it is clear that majority of the repairs were done by Bob and John. Majority of the Mechanical repair jobs were performed by John; As such, if there are constant mechanical issues being witnesses with machines the bulk of the blame should be directed at John. It is therefore important for management to develop a plan that will ensure that John is assessed and trained on his competence as a mechanical repair. On the other hand, majority of the electrical repairs were performed by Bob; likewise, he should be held as most accountable for repetitive electrical issues with the machines. It is important to note that James has done very little compared to the other two employees. It is recommendable that he be moved to another department of be let go, because his productivity level is a quarter of the other two employees. As such, he should be doing more jobs to ensure that they each perform an equal 30 repairs.

	Mechanical Jobs	Electrical Jobs	Total Jobs
James	4	6	10
John	23	17	40
Bob	17	23	40

Given that the data is randomly generated we can only assess statistics like mean through several computations. From the figures below it is clear that James takes the least time to perform a repair and it is evident that Bob takes the most time when it comes to performing a task. The table for months since last repair indicate that James has stayed the longest without being assigned a task; while, bob is the one who has been getting majority of the repair jobs recently. As such, I would recommend that James be assigned more tasks compared to Bob to allow for efficiency and diminished downtime will the machine is being fixed. It can also be hypothesized that Bob repairs the machines in an incompetent manner and takes a long-time while doing it.

	Repair Time
	James	John	Bob
Mean 1	8.07	9.1025	13.8375
Mean 2	8.76	8.965	14.0102564
Mean 3	9.06	9.1175	13.9657895
Mean 4	9.94	9.0075	14.0837838
Mean 5	9.28	9.365	13.8666667

	Months Since Last Repair
	James	John	Bob
Mean 1	6.5	6.2	4.8
Mean 2	6.4	6.2	4.8
Mean 3	6.5	6.1	4.8
Mean 4	7.1	6.0	4.8
Mean 5	7.1	6.0	4.8

Practical Examples of Logistic Regression in Business

Part (b)

Using the data, tables, and charts we have it is clear that in recent months the number of repairs have increased. If we assume that James did a repair at least six months ago, and the other two repair persons are the ones who have been responsible for all repairs since then. It is clear that six months ago the company was getting good machines from suppliers and they only required very few repairs that took only a few hours to complete. However, in the last few months the number of repairs has increased significantly and the time required to complete them has also increased substantially. This means that the products they are receiving now from suppliers are defective and difficult to repair properly. As a result, you now see an individual repair person performing four times the number for jobs another individual would have done seven months ago.

Part (c)

The manager can add data regarding the period between one repair and the need for the same machine. Moreover, the manager can add the identities of the machines that were repaired to ensure that one can track which products has received the bulk of the repairs. Another dataset that can also be included is the name of the supplier who sold the business that particular machine. By so doing, the analysis will reveal which individual has been supplying ineffective products. In the inclusion of this data will allow for expansive analysis such as logistic regression and correlation analysis.

Question 7

Part (a)

We can check the identity of the users with missing blood type information and see whether the individual has donated before or after the specified date. If the person has donate on another occasion and the blood type is indicate, that value should be replicated in the empty cell for the same individual since blood type does not change. We can use a COUNTIF function to assess the frequency of each donor ID, and then use the find box to get the location of the other donation occasion with the same donor. However, if the person has only donated once a logistic regression using the available data can be performed to establish a relationship between the other variables (excluding date) and blood type. We will then use the logistic regression equation to estimate the blood type for each of the first time donors.

Part (b)

I first extracted all unique donor Ids and copied them to another location on the same worksheet using the advanced filter feature found under the data tab. I used the SUMIFS function to sum up the total protein values for each unique donor. After this, I employed the COUNTIFS function to compute the frequency of each unique donor in the data. Lastly, I divided the sum of total protein for each unique donor by their respective frequency.

Part (C).

We first construct a pivot table by selecting the data with Donor ID and Total Protein Level. We then go to insert tab and select pivot table. We the assign donor Id to row and total protein level to values. We click on the Sum of Scores drop-down arrow and select value field setting and change the setting from sum to max. Copy the max values to a column labelled maximum value and label the adjacent column minimum (go back to the pivot table and change the value field setting from max to min). Copy the new data to the minimum column. The range is given by subtracting minimum from maximum for all donors.

Developing a KNN Model to Predict Customer Spending

Part (d)

The drawing a X-Y scatter for age against total protein level, suggests that the concentration does not decline with age. In fact it proposes that it increases by a very small quantity as one gets older.

Part (e)

	Age	Total Protoean level (g/dL)
Age	1
Total Protoean level (g/dL)	0.03925393	1

Question 8

Part (a)

I would suggest a multiple linear regression model because there are two or more independent variables. Moreover, the numerical data is considerably large and the non-numerical data can be assigned values thereby creating dummy variables (Montgomery, Peck & Vining 2015).

Part (b)

From the results above the regression equation can be written as follows where Y (Risk), X₁ (age), X₂(weight), and X₃ (gender)

y=-30.4+0.8x₁+0.4x₂+5.11x₃

Interpreting the coefficients; an increment by one year in the variable age will cause a 0.7999 or 0.8 increase in the risk of getting diabetes. An increment of one kilogram of weight will cause a 0.3902 or 0.4 increase in the risk of getting diabetes; while, being a male will cause a 5.1087 or 5.11 increase in the risk of getting diabetes. Lastly, if the values of all three variables are equivalent to zero then a person would stand a -30.396 or -30.4 risk of developing diabetes. Looking at the significance of F we can tell the more is significant at alpha (0.01, 0.05, and 0.1). Moreover, the adjusted R-squared is very highly, meaning that 80.11% change in the dependent variable can be explained by the independent variable.

Part (c)

From the results above the regression equation can be written as follows where Y (Risk), X₁ (age), X₂(weight), X₃ (gender), and X₄ (life style)

y=-29.64+0.79x₁+0.38x₂+4.64x₃+0.59x₄

Interpreting the coefficients; an increment by one year in the variable age will cause a 0.7922 or 0.79 increase in the risk of getting diabetes. An increment of one kilogram of weight will cause a 0.3813 or 0.38 increase in the risk of getting diabetes; while, being a male will cause a 4.6404 or 4.64 increase in the risk of getting diabetes. And residing in the Country or a Big city will increase your risk by 0.586 and Lastly, if the values of all three variables are equivalent to zero then a person would stand a -29.6437 or -29.64 risk of developing diabetes. Looking at the significance of F we can tell the more is significant at alpha (0.01, 0.05, and 0.1). Moreover, the adjusted R-squared is very highly, meaning that 79.21% change in the dependent variable can be explained by the independent variable.

Part (d)

Model 1

y=-30.4+0.8x₁+0.4x₂+5.11x₃

y=-30.4+0.8(59)+0.4(72)+5.11(1)

y=50.71

Model 2

y=-29.64+0.79x₁+0.38x₂+4.64x₃+0.59x₄

y=-29.64+0.79(59)+0.38(72)+4.64(1)+0.59(0)

y=48.97

There is a difference in the risk values of 1. 74. It is therefore advisable to use the model with more variables because it takes into consideration all factors.

Question 9

Part (a)

	Salary Increment	Annual Income	Percentage of Income Invested	Balance of Retirement Account	Return (5%)
		76,000	12%	9120	0
Year 1	3%	78280	12%	9849.6	729.6
Year 2	3%	80628.4	12%	10167.888	318.288
Year 3	3%	83047.252	12%	10474.06464	306.1766
Year 4	3%	85538.66956	12%	10788.34358	314.2789
Year 5	3%	88104.82965	12%	11111.99674	323.6532
Year 6	3%	90747.97454	12%	11445.35678	333.36
Year 7	3%	93470.41377	12%	11788.71749	343.3607
Year 8	3%	96274.52619	12%	12142.37902	353.6615
Year 9	3%	99162.76197	12%	12506.65039	364.2714
Year 10	3%	102137.6448	12%	12881.8499	375.1995
Year 11	3%	105201.7742	12%	13268.3054	386.4555
Year 12	3%	108357.8274	12%	13666.35456	398.0492
Year 13	3%	111608.5622	12%	14076.34519	409.9906
Year 14	3%	114956.8191	12%	14498.63555	422.2904
Year 15	3%	118405.5237	12%	14933.59462	434.9591
Year 16	3%	121957.6894	12%	15381.60246	448.0078
Year 17	3%	125616.4201	12%	15843.05053	461.4481
Year 18	3%	129384.9127	12%	16318.34204	475.2915
Year 19	3%	133266.46	12%	16807.89231	489.5503
Year 20	3%	137264.4538	12%	17312.12908	504.2368
Year 21	3%	141382.3874	12%	17831.49295	519.3639
Year 22	3%	145623.8591	12%	18366.43774	534.9448
Year 23	3%	149992.5748	12%	18917.43087	550.9931
Year 24	3%	154492.3521	12%	19484.95379	567.5229
Year 25	3%	159127.1227	12%	20069.50241	584.5486
Year 26	3%	163900.9363	12%	20671.58748	602.0851
Year 27	3%	168817.9644	12%	21291.7351	620.1476
Year 28	3%	173882.5034	12%	21930.48716	638.7521
Year 29	3%	179098.9785	12%	22588.40177	657.9146
Year 30	3%	184471.9478	12%	23266.05383	677.6521

Part (b)

In the calculations above, we assumed that he would be investing 12% of his annual salary for retirement purposes; as such, he will have a total $23,266.05 after 30 years. Now in order to estimate how much he will need to invest to be able to have $1,500,000 at the end of 30 years; we assume that instead of investing 12% he decided to invest 90% of his annual income. According to the calculations in the excel worksheet, he will have amassed a total of $174,495.40 within the specified period.

We can now use interpolation to find how much of his annual income he will need to invest to get savings of $1,500,000.

The formula is as follows

Where:

x1=90%

x2=12%

x=?

y1=

y2=

y=$1,500,000

As such, we can solve for the value of x;

x=773.66%

Hence, he would have to invest 773.66% of his salary

Question 10

Part A

	we will have 4 decision variables because website A’s “variable” will be split into two
	Decision Variables		Split Into
	Website A’s variables	X1	X11	X12	Hence X1=X11+X12
	Website B’s Variable	X2
	Website C’s Variable	X3

	Our Objective is to maximize the number of views from all three advertisement platforms
	Objective Function	max z=130,000(X11+X12)+35,000(X2)+80,000(X3)


	The objective function will be subject to the following constraints
1	X11+X12+X2+X3<=70
2	2,500(X11)+2,200(X12)+500(X2)+800(x3)<=68,000
3	X11+X12>=14
4	X11=5
5	X12<=10
6	X2<=49
7	X3<=35
9	X11,X12,X2,X3>=1

Hence Website A will run 12 adverts, Website B will run 20 websites, and Website C will run 35 adverts. Maximum total viewers expected are roughly 5,060,000.

Part B

If they want to 11 adverts on Website A and still maximize the total number of views. We have to change a constraint and state that X11+X12>=11. By adjusting the value of the charges for any adverts above 5 (trial and error technical) we see that they will have to charge the figure below and maximize the total number of viewers to 4,930,000

Cost per advertisement for the first 5 advertisements	$2,500
Cost per advertisement for more than 5 advertisements	$2,700
Total Cost of Website A adverts	$5,200