Imbalance data is a common problem in classification; it is created when classes are not of equal size/ volume (Hilbe 2009). One example, the daily attendance register for employees in an office setting, you can get that out of 200 individuals 4 were absent and 196 were present on 18th of March 2018. Another example is there are four departments (A, B, C, and D) that generate the same product for an organization. According to data statistics collected by the organization the overall product yield can be attributed 50% to A, 20% to B, 12% to C, and 18% to D. These two examples show how common the issue of imbalance data is in the world of business. To address this problem there are two techniques we can employ such as random under-sampling and re-sampling. Random under-sampling seeks to achieve equality in class distribution by through the random negation of the majority class. While, re-sampling technique calls for the increment of the minority classes or an alternative would be to decrease the majority classes in order to balance the classes (Hilbe 2009).
Question 2
In a logistic regression it is impossible to get a single line that goes through all the point thereby indicating the line of best fit. Since, it is impossible to get the value of 1 for R squared there is no use of R-squared and adjusted R-squared in a regression analysis. As such, special tests have been developed to tackle this problem; for example, McFadden’s pseudo-R-squared (Hilbe 2009).
Question 3
Logistic regression can be used by business management because they can assign values of 0s and 1s to data to distinguish the classes. For example, management can use 1s to denote employs who attended a meeting and 0s to indicate those who were absent from the meeting. Another example of where logistic regression can be applied in a business setting is when information is being collected on how many employees received bonuses. The individuals that received bonuses will be denoted by 1s and those who did not get bonuses will be represented by 0s. Therefore, logistic regression is used in organization to analysis qualitative data with fixed responses/choices that can be assigned numerical value.
Question 4
Each of the levels of the first explanatory variable (X1) will be assigned numerical values that are unique. For example, 0 for low, 1 for average, 2 for high, and 3 for very high. Likewise, the same will be done for the second explanatory variable (X2); As such, the levels will be assigned values like 0=Sydney, 1=Melbourne, and 2=Brisbane. It is easy to see that the same number allocation system can be used on different variables with unrelated data; given X1 deals with a ranking system and X2 deals with Australian cities. Since we have two independent variables where will be three coefficients i.e. B0, B1, and B2. It is important to indicate that B1 and B2 are the coefficients for X1 and X2 respectively.
Question 5
Part (a)
KNN models are based on a non-parametric learning technique through which we attempt to predict the value of give variable based on a training set. The first step involves the evaluation of similarity through the use of distance functions like Euclidean. The formula we will employ is
The second step deals with finding the K-nearest neighbours. For instance, you get the five most closest to the desired value and then choose which spending level best suits the customer in question. We can also chart a graph to demonstrate the distance values to best see which ones are closest to that of the new customer.
Part (b)
Yes, it will increase because we will be given a wide scope from where to choose the new customers spending. Moreover, using the CONTIF function in Microsoft Excel it is clear that 100% of all customers in the data spent more than $500; As such, it is very likely that this new customer will spend at least $500.
Part (c)
The type of product being purchased information is omitted as such that first column will be ignore in our calculation. After calculation the new female customer had a distance of 0 with a pre-exist customer. Therefore, we will conclude the customer is most likely to spend $938
Question 6
Part (a)
The best thing would be to analysis the data and compare the variables to discern their relationship. For instance, we can compare how much of a give type of repair was performed by a particular repair person. According to the data and chart below it is clear that majority of the repairs were done by Bob and John. Majority of the Mechanical repair jobs were performed by John; As such, if there are constant mechanical issues being witnesses with machines the bulk of the blame should be directed at John. It is therefore important for management to develop a plan that will ensure that John is assessed and trained on his competence as a mechanical repair. On the other hand, majority of the electrical repairs were performed by Bob; likewise, he should be held as most accountable for repetitive electrical issues with the machines. It is important to note that James has done very little compared to the other two employees. It is recommendable that he be moved to another department of be let go, because his productivity level is a quarter of the other two employees. As such, he should be doing more jobs to ensure that they each perform an equal 30 repairs.
Mechanical Jobs |
Electrical Jobs |
Total Jobs |
|
James |
4 |
6 |
10 |
John |
23 |
17 |
40 |
Bob |
17 |
23 |
40 |
Given that the data is randomly generated we can only assess statistics like mean through several computations. From the figures below it is clear that James takes the least time to perform a repair and it is evident that Bob takes the most time when it comes to performing a task. The table for months since last repair indicate that James has stayed the longest without being assigned a task; while, bob is the one who has been getting majority of the repair jobs recently. As such, I would recommend that James be assigned more tasks compared to Bob to allow for efficiency and diminished downtime will the machine is being fixed. It can also be hypothesized that Bob repairs the machines in an incompetent manner and takes a long-time while doing it.
Repair Time |
|||
James |
John |
Bob |
|
Mean 1 |
8.07 |
9.1025 |
13.8375 |
Mean 2 |
8.76 |
8.965 |
14.0102564 |
Mean 3 |
9.06 |
9.1175 |
13.9657895 |
Mean 4 |
9.94 |
9.0075 |
14.0837838 |
Mean 5 |
9.28 |
9.365 |
13.8666667 |
Months Since Last Repair |
|||
James |
John |
Bob |
|
Mean 1 |
6.5 |
6.2 |
4.8 |
Mean 2 |
6.4 |
6.2 |
4.8 |
Mean 3 |
6.5 |
6.1 |
4.8 |
Mean 4 |
7.1 |
6.0 |
4.8 |
Mean 5 |
7.1 |
6.0 |
4.8 |
Part (b)
Using the data, tables, and charts we have it is clear that in recent months the number of repairs have increased. If we assume that James did a repair at least six months ago, and the other two repair persons are the ones who have been responsible for all repairs since then. It is clear that six months ago the company was getting good machines from suppliers and they only required very few repairs that took only a few hours to complete. However, in the last few months the number of repairs has increased significantly and the time required to complete them has also increased substantially. This means that the products they are receiving now from suppliers are defective and difficult to repair properly. As a result, you now see an individual repair person performing four times the number for jobs another individual would have done seven months ago.
Part (c)
The manager can add data regarding the period between one repair and the need for the same machine. Moreover, the manager can add the identities of the machines that were repaired to ensure that one can track which products has received the bulk of the repairs. Another dataset that can also be included is the name of the supplier who sold the business that particular machine. By so doing, the analysis will reveal which individual has been supplying ineffective products. In the inclusion of this data will allow for expansive analysis such as logistic regression and correlation analysis.
Question 7
Part (a)
We can check the identity of the users with missing blood type information and see whether the individual has donated before or after the specified date. If the person has donate on another occasion and the blood type is indicate, that value should be replicated in the empty cell for the same individual since blood type does not change. We can use a COUNTIF function to assess the frequency of each donor ID, and then use the find box to get the location of the other donation occasion with the same donor. However, if the person has only donated once a logistic regression using the available data can be performed to establish a relationship between the other variables (excluding date) and blood type. We will then use the logistic regression equation to estimate the blood type for each of the first time donors.
Part (b)
I first extracted all unique donor Ids and copied them to another location on the same worksheet using the advanced filter feature found under the data tab. I used the SUMIFS function to sum up the total protein values for each unique donor. After this, I employed the COUNTIFS function to compute the frequency of each unique donor in the data. Lastly, I divided the sum of total protein for each unique donor by their respective frequency.
Part (C).
We first construct a pivot table by selecting the data with Donor ID and Total Protein Level. We then go to insert tab and select pivot table. We the assign donor Id to row and total protein level to values. We click on the Sum of Scores drop-down arrow and select value field setting and change the setting from sum to max. Copy the max values to a column labelled maximum value and label the adjacent column minimum (go back to the pivot table and change the value field setting from max to min). Copy the new data to the minimum column. The range is given by subtracting minimum from maximum for all donors.
Part (d)
The drawing a X-Y scatter for age against total protein level, suggests that the concentration does not decline with age. In fact it proposes that it increases by a very small quantity as one gets older.
Part (e)
Age |
Total Protoean level (g/dL) |
|
Age |
1 |
|
Total Protoean level (g/dL) |
0.03925393 |
1 |
Question 8
Part (a)
I would suggest a multiple linear regression model because there are two or more independent variables. Moreover, the numerical data is considerably large and the non-numerical data can be assigned values thereby creating dummy variables (Montgomery, Peck & Vining 2015).
Part (b)
From the results above the regression equation can be written as follows where Y (Risk), X1 (age), X2 (weight), and X3 (gender)
y=-30.4+0.8x1+0.4x2+5.11x3
Interpreting the coefficients; an increment by one year in the variable age will cause a 0.7999 or 0.8 increase in the risk of getting diabetes. An increment of one kilogram of weight will cause a 0.3902 or 0.4 increase in the risk of getting diabetes; while, being a male will cause a 5.1087 or 5.11 increase in the risk of getting diabetes. Lastly, if the values of all three variables are equivalent to zero then a person would stand a -30.396 or -30.4 risk of developing diabetes. Looking at the significance of F we can tell the more is significant at alpha (0.01, 0.05, and 0.1). Moreover, the adjusted R-squared is very highly, meaning that 80.11% change in the dependent variable can be explained by the independent variable.
Part (c)
From the results above the regression equation can be written as follows where Y (Risk), X1 (age), X2 (weight), X3 (gender), and X4 (life style)
y=-29.64+0.79x1+0.38x2+4.64x3+0.59x4
Interpreting the coefficients; an increment by one year in the variable age will cause a 0.7922 or 0.79 increase in the risk of getting diabetes. An increment of one kilogram of weight will cause a 0.3813 or 0.38 increase in the risk of getting diabetes; while, being a male will cause a 4.6404 or 4.64 increase in the risk of getting diabetes. And residing in the Country or a Big city will increase your risk by 0.586 and Lastly, if the values of all three variables are equivalent to zero then a person would stand a -29.6437 or -29.64 risk of developing diabetes. Looking at the significance of F we can tell the more is significant at alpha (0.01, 0.05, and 0.1). Moreover, the adjusted R-squared is very highly, meaning that 79.21% change in the dependent variable can be explained by the independent variable.
Part (d)
Model 1
y=-30.4+0.8x1+0.4x2+5.11x3
y=-30.4+0.8(59)+0.4(72)+5.11(1)
y=50.71
Model 2
y=-29.64+0.79x1+0.38x2+4.64x3+0.59x4
y=-29.64+0.79(59)+0.38(72)+4.64(1)+0.59(0)
y=48.97
There is a difference in the risk values of 1. 74. It is therefore advisable to use the model with more variables because it takes into consideration all factors.
Question 9
Part (a)
Salary Increment |
Annual Income |
Percentage of Income Invested |
Balance of Retirement Account |
Return (5%) |
|
76,000 |
12% |
9120 |
0 |
||
Year 1 |
3% |
78280 |
12% |
9849.6 |
729.6 |
Year 2 |
3% |
80628.4 |
12% |
10167.888 |
318.288 |
Year 3 |
3% |
83047.252 |
12% |
10474.06464 |
306.1766 |
Year 4 |
3% |
85538.66956 |
12% |
10788.34358 |
314.2789 |
Year 5 |
3% |
88104.82965 |
12% |
11111.99674 |
323.6532 |
Year 6 |
3% |
90747.97454 |
12% |
11445.35678 |
333.36 |
Year 7 |
3% |
93470.41377 |
12% |
11788.71749 |
343.3607 |
Year 8 |
3% |
96274.52619 |
12% |
12142.37902 |
353.6615 |
Year 9 |
3% |
99162.76197 |
12% |
12506.65039 |
364.2714 |
Year 10 |
3% |
102137.6448 |
12% |
12881.8499 |
375.1995 |
Year 11 |
3% |
105201.7742 |
12% |
13268.3054 |
386.4555 |
Year 12 |
3% |
108357.8274 |
12% |
13666.35456 |
398.0492 |
Year 13 |
3% |
111608.5622 |
12% |
14076.34519 |
409.9906 |
Year 14 |
3% |
114956.8191 |
12% |
14498.63555 |
422.2904 |
Year 15 |
3% |
118405.5237 |
12% |
14933.59462 |
434.9591 |
Year 16 |
3% |
121957.6894 |
12% |
15381.60246 |
448.0078 |
Year 17 |
3% |
125616.4201 |
12% |
15843.05053 |
461.4481 |
Year 18 |
3% |
129384.9127 |
12% |
16318.34204 |
475.2915 |
Year 19 |
3% |
133266.46 |
12% |
16807.89231 |
489.5503 |
Year 20 |
3% |
137264.4538 |
12% |
17312.12908 |
504.2368 |
Year 21 |
3% |
141382.3874 |
12% |
17831.49295 |
519.3639 |
Year 22 |
3% |
145623.8591 |
12% |
18366.43774 |
534.9448 |
Year 23 |
3% |
149992.5748 |
12% |
18917.43087 |
550.9931 |
Year 24 |
3% |
154492.3521 |
12% |
19484.95379 |
567.5229 |
Year 25 |
3% |
159127.1227 |
12% |
20069.50241 |
584.5486 |
Year 26 |
3% |
163900.9363 |
12% |
20671.58748 |
602.0851 |
Year 27 |
3% |
168817.9644 |
12% |
21291.7351 |
620.1476 |
Year 28 |
3% |
173882.5034 |
12% |
21930.48716 |
638.7521 |
Year 29 |
3% |
179098.9785 |
12% |
22588.40177 |
657.9146 |
Year 30 |
3% |
184471.9478 |
12% |
23266.05383 |
677.6521 |
Part (b)
In the calculations above, we assumed that he would be investing 12% of his annual salary for retirement purposes; as such, he will have a total $23,266.05 after 30 years. Now in order to estimate how much he will need to invest to be able to have $1,500,000 at the end of 30 years; we assume that instead of investing 12% he decided to invest 90% of his annual income. According to the calculations in the excel worksheet, he will have amassed a total of $174,495.40 within the specified period.
We can now use interpolation to find how much of his annual income he will need to invest to get savings of $1,500,000.
The formula is as follows
Where:
x1=90%
x2=12%
x=?
y1=
y2=
y=$1,500,000
As such, we can solve for the value of x;
x=773.66%
Hence, he would have to invest 773.66% of his salary
Question 10
Part A
we will have 4 decision variables because website A’s “variable” will be split into two |
||||||
Decision Variables |
Split Into |
|||||
Website A’s variables |
X1 |
X11 |
X12 |
Hence X1=X11+X12 |
||
Website B’s Variable |
X2 |
|||||
Website C’s Variable |
X3 |
|||||
Our Objective is to maximize the number of views from all three advertisement platforms |
||||||
Objective Function |
max z=130,000(X11+X12)+35,000(X2)+80,000(X3) |
|||||
The objective function will be subject to the following constraints |
||||||
1 |
X11+X12+X2+X3<=70 |
|||||
2 |
2,500(X11)+2,200(X12)+500(X2)+800(x3)<=68,000 |
|||||
3 |
X11+X12>=14 |
|||||
4 |
X11=5 |
|||||
5 |
X12<=10 |
|||||
6 |
X2<=49 |
|||||
7 |
X3<=35 |
|||||
9 |
X11,X12,X2,X3>=1 |
Hence Website A will run 12 adverts, Website B will run 20 websites, and Website C will run 35 adverts. Maximum total viewers expected are roughly 5,060,000.
Part B
If they want to 11 adverts on Website A and still maximize the total number of views. We have to change a constraint and state that X11+X12>=11. By adjusting the value of the charges for any adverts above 5 (trial and error technical) we see that they will have to charge the figure below and maximize the total number of viewers to 4,930,000
Cost per advertisement for the first 5 advertisements |
$2,500 |
Cost per advertisement for more than 5 advertisements |
$2,700 |
Total Cost of Website A adverts |
$5,200 |
References
Hilbe, JM 2009, Logistic Regression Models, CRC Press, Florida.
Montgomery, DC, Peck, EA & Vining, GG 2015, Introduction to Linear Regression Analysis, 4th edn, John Wiley & Sons, Hoboken
Essay Writing Service Features
Our Experience
No matter how complex your assignment is, we can find the right professional for your specific task. Contact Essay is an essay writing company that hires only the smartest minds to help you with your projects. Our expertise allows us to provide students with high-quality academic writing, editing & proofreading services.Free Features
Free revision policy
$10Free bibliography & reference
$8Free title page
$8Free formatting
$8How Our Essay Writing Service Works
First, you will need to complete an order form. It's not difficult but, in case there is anything you find not to be clear, you may always call us so that we can guide you through it. On the order form, you will need to include some basic information concerning your order: subject, topic, number of pages, etc. We also encourage our clients to upload any relevant information or sources that will help.
Complete the order formOnce we have all the information and instructions that we need, we select the most suitable writer for your assignment. While everything seems to be clear, the writer, who has complete knowledge of the subject, may need clarification from you. It is at that point that you would receive a call or email from us.
Writer’s assignmentAs soon as the writer has finished, it will be delivered both to the website and to your email address so that you will not miss it. If your deadline is close at hand, we will place a call to you to make sure that you receive the paper on time.
Completing the order and download