In today’s world, we have one of the main problem in distribution of data in humongous data. The event in which we are interested occurs very rarely as compared to the normal events. Suppose we are interested in fraud transactions then this type of transactions are significantly lower than other healthy transactions. About only 1-5% transactions may be fraud transactions. So it will create two class, major class with healthy transaction and minor class with fraud transaction.
Before the training data is given for analysis data preprocessing is done. There are two main approach to handles the imbalanced data.
Over-fitting: Over-fitting is modeling error which occurs when model perfectly fit the data. It may be occur due to involving more independent variables or using more complex model. Overfitting may cause more prediction errors.
Over-fitting can be avoided by using following:
Examples where Logistic Regression is used:
We cannot use linear regression in this case as the very first assumption is not satisfied. Here our response variable is binary (0 and 1), in the linear regression we have assumption that response variable must follow normal distribution. In the above examples, we can see that response variables takes either 1 or 0.
We used Draper and Smith (2014).
SECTION B: QUANTITATIVE QUESTIONS
Cluster |
Types of Products |
Perchase amount ($) |
Age |
Gender |
Marital Status |
Membership |
Discount card |
Cluster-1 |
2.35 |
1071.19 |
56.12 |
0.49 |
0.86 |
0.49 |
0.00 |
Cluster-2 |
2.61 |
1110.00 |
47.34 |
0.28 |
0.98 |
0.71 |
0.99 |
Cluster-3 |
2.44 |
1210.38 |
34.06 |
0.63 |
0.01 |
0.33 |
0.49 |
Cluster-4 |
4.00 |
758.98 |
43.34 |
0.94 |
0.67 |
0.34 |
0.70 |
Cluster-5 |
4.31 |
1224.08 |
62.15 |
0.59 |
0.23 |
0.63 |
0.42 |
Cluster-6 |
3.62 |
743.44 |
41.01 |
0.08 |
0.21 |
0.55 |
0.33 |
Cluster-1 have lowest average type of products and average discount cards.
Cluster-2 have highest average discount cards, average marital status and average membership.
Cluster-3 have lowest average membership, average marital status and average age.
Cluster-4 have highest average gender.
Cluster 5 has highest average type of products, average purchase amount ($) and average age.
Cluster-6 has lowest average gender and average purchase amount ($).
Scatter Chart
As from scatter chart, we can see the relationship between Index A and Index B. If the value of index A increases value of index B increases.
To see the strong ness of the relationship, we calculate the correlation coefficient between index A and index B.
Correlation coefficient between index A and index B is 0.8 suggests that there is strong correlation between index A and index B.
SUMMARY OUTPUT |
||||||
Regression Statistics |
||||||
Multiple R |
0.804376 |
|||||
R Square |
0.647021 |
|||||
Adjusted R Square |
0.611724 |
|||||
Standard Error |
139.5223 |
|||||
Observations |
12 |
|||||
ANOVA |
||||||
df |
SS |
MS |
F |
Significance F |
||
Regression |
1 |
356826.9 |
356826.9 |
18.33033 |
0.001608 |
|
Residual |
10 |
194664.7 |
19466.47 |
|||
Total |
11 |
551491.7 |
||||
Coefficients |
Standard Error |
t Stat |
P-value |
Lower 95% |
Upper 95% |
|
Intercept |
24021.81 |
1330.167 |
18.05924 |
5.81E-09 |
21058.01 |
26985.6 |
Index A |
23.49787 |
5.48837 |
4.281394 |
0.001608 |
11.26902 |
35.72672 |
Estimated Model:
Index B = 24021.81 + 23.49787 × Index A
SUMMARY OUTPUT |
||||||
Regression Statistics |
||||||
Multiple R |
0.976488 |
|||||
R Square |
0.95353 |
|||||
Adjusted R Square |
0.936103 |
|||||
Standard Error |
56.59941 |
|||||
Observations |
12 |
|||||
ANOVA |
||||||
df |
SS |
MS |
F |
Significance F |
||
Regression |
3 |
525863.7 |
175287.9 |
54.71773 |
1.13E-05 |
|
Residual |
8 |
25627.95 |
3203.494 |
|||
Total |
11 |
551491.7 |
||||
Coefficients |
Standard Error |
t Stat |
P-value |
Lower 95% |
Upper 95% |
|
Intercept |
6195905 |
855511.5 |
7.24234 |
8.87E-05 |
4223092 |
8168718 |
X |
-76512.7 |
10619.48 |
-7.20494 |
9.2E-05 |
-101001 |
-52024.1 |
X2 |
316.1921 |
43.91855 |
7.199511 |
9.25E-05 |
214.9157 |
417.4685 |
X3 |
-0.43519 |
0.060515 |
-7.19134 |
9.32E-05 |
-0.57473 |
-0.29564 |
We observed that P value in ANOVA is 1.13E-05 < 0.05 suggests that there is significant relation between dependent variable (Index B) and independent variables (Index A, (Index A)2 and (Index A)3).
So this polynomial model is statistical significant. We can also observed that each coefficient is statistically significant.
Polynomial model is better than simple regression model as polynomial regression explain about 95% total variation whereas simple linear regression explain only 64.7% variation.
SUMMARY OUTPUT |
||||||
Regression Statistics |
||||||
Multiple R |
0.951913 |
|||||
R Square |
0.906138 |
|||||
Adjusted R Square |
0.888538 |
|||||
Standard Error |
4.734472 |
|||||
Observations |
20 |
|||||
ANOVA |
||||||
df |
SS |
MS |
F |
Significance F |
||
Regression |
3 |
3462.306 |
1154.102 |
51.48742 |
1.93E-08 |
|
Residual |
16 |
358.6436 |
22.41523 |
|||
Total |
19 |
3820.95 |
||||
Coefficients |
Standard Error |
t Stat |
P-value |
Lower 95% |
Upper 95% |
|
Intercept |
-45.5723 |
9.81955 |
-4.64098 |
0.000272 |
-66.3888 |
-24.7558 |
Age |
0.736801 |
0.081895 |
8.996877 |
1.17E-07 |
0.563191 |
0.910411 |
Weight (Kg) |
0.617927 |
0.10921 |
5.658165 |
3.56E-05 |
0.386412 |
0.849441 |
Gender (Male) |
5.53566 |
2.391788 |
2.314445 |
0.034258 |
0.465297 |
10.60602 |
Regression formula as a mathematical equation:
Risk (%) = -45.5723 + 0.736801 × Age + 0.617927 × Weight (kg) + 5.53566 × Gender (Male)
Where Gender (Male) = 1 if gender of person is male and 0 otherwise.
Interpretation of Coefficients:
Age: If age is increased by 1 year then there is incline in risk of diabetes to the person by 0.736801%.
Weight: If weight is increased by 1 kg then there is incline in risk of diabetes to the person by 0.617927%.
Gender (Male): For male person there is incline in risk of diabetes to the person by 5.53566%.
Strength of the regression:
We can observed that R2 is 0.906138 suggests that fitting is very good.
SUMMARY OUTPUT |
||||||
Regression Statistics |
||||||
Multiple R |
0.954709 |
|||||
R Square |
0.91147 |
|||||
Adjusted R Square |
0.879852 |
|||||
Standard Error |
4.915485 |
|||||
Observations |
20 |
|||||
ANOVA |
||||||
df |
SS |
MS |
F |
Significance F |
||
Regression |
5 |
3482.682 |
696.5364 |
28.82778 |
6.72E-07 |
|
Residual |
14 |
338.2679 |
24.16199 |
|||
Total |
19 |
3820.95 |
||||
Coefficients |
Standard Error |
t Stat |
P-value |
Lower 95% |
Upper 95% |
|
Intercept |
-44.851 |
10.79272 |
-4.15567 |
0.000971 |
-67.9991 |
-21.7029 |
Age |
0.739998 |
0.085197 |
8.685702 |
5.19E-07 |
0.557268 |
0.922728 |
Weight (Kg) |
0.586525 |
0.123644 |
4.743675 |
0.000314 |
0.321336 |
0.851714 |
Gender (Male) |
4.99797 |
2.773922 |
1.80177 |
0.09315 |
-0.9515 |
10.94744 |
Life style (Small town) |
2.418534 |
2.999258 |
0.806377 |
0.43351 |
-4.01423 |
8.851301 |
Life style (Big city) |
2.476713 |
3.043866 |
0.813674 |
0.429454 |
-4.05173 |
9.005155 |
Regression formula as a mathematical equation:
Risk (%) = -44.851 + 0.739998 × Age + 0.586525 × Weight (kg) + 4.99797 × Gender (Male) + 2.418534 × Life style (Small town) + 2.476713 × Life style (Big city)
Where Gender (Male) = 1 if gender of person is male and 0 otherwise.
Life style (Small town) = 1 if person living in small town and 0 otherwise.
Life style (Big city) = 1 if person living in big city and 0 otherwise.
Interpretation of Coefficients:
Age: If age is increased by 1 year then there is incline in risk of diabetes to the person by 0.739998%.
Weight: If weight is increased by 1 kg then there is incline in risk of diabetes to the person by 0.586525%.
Gender (Male): For male person there is incline in risk of diabetes to the person by 4.99797%.
Life style (Small town): For person living in small town there is incline in risk of diabetes to the person by 2.418534%.
Life style (Small town): For person living in big city there is incline in risk of diabetes to the person by 2.476713%.
Strength of the regression:
We can observed that R2 is 0.91147 suggests that fitting is very good.
Risk (%) = -44.851 + 0.739998 × Age + 0.586525 × Weight (kg) + 4.99797 × Gender (Male) + 2.418534 × Life style (Small town) + 2.476713 × Life style (Big city)
Risk (%) = -44.851 + 0.739998 × 55 + 0.586525 × 70 + 4.99797 × 1 + 2.418534 × 0 + 2.476713 ×1
Risk (%) = 44.38
For first model (Given in 4-1-1):
For second model (Given in 4-2-1):
Analyst has changed one of the input variables as performance of model 1 (Given in 4-1-1) is not good. We can observe the precision, sensitivity, specificity and F1-score for model 1 from following Table.
Precision |
0.434816 |
Recall (Sensitivity) |
0.370104 |
Specificity |
0.675044 |
F1-Score |
0.399859 |
To improve the model performance, analyst include variables in model 1.
logit (p) = 11.13009787 + 279.210573 × Contract duration (month) + 1936.825499 × Bonus data (GB) + 7800.776932 × Usage (GB) + 18.87844045 × A-Plan + 25.5710999 × B-Plan + 28.76348699 × C-Plan
where p is the probability that customer will buy new service.
logit (p)= log(p/(1-p))
Plan-A = 1 if customers adopt plan A and 0 otherwise
Plan-B = 1 if customers adopt plan B and 0 otherwise
Plan-C = 1 if customers adopt plan C and 0 otherwise
Prediction of a customer with Contract duration of 16 months, Bonus data of 63 GB and Usage of 237 GB whether he/she will decide to buy the new service or not?
To find this prediction, we first obtain the logit function of p. (p is the probability that customer buy new service.) After getting logit(p), we calculate the p by using following transformation as
p= 1/ (1+exp(-logit(p)))
p closer to 1 suggests that customer buy new service whereas closer to 0 suggests that not buy new service.
For given information:
logit (p) = 11.13009787 + 279.210573 × Contract duration (month) + 1936.825499 × Bonus data (GB) + 7800.776932 × Usage (GB) + 18.87844045 × A-Plan + 25.5710999 × B-Plan + 28.76348699 × C-Plan
logit (p) = 11.13009787 + 279.210573 × 16 + 1936.825499 × 63 + 7800.776932 × 237 + 18.87844045 × 0 + 25.5710999 × 0 + 28.76348699 × 0
logit(p) = 1975282.638725
So,
p 1
Suggests that he/she will decide to buy the new service.
We create the following table from 4-1-2 sheet
Predicted |
|||
Actual Class |
0 |
1 |
Total |
0 |
1339 |
625 |
1964 |
1 |
852 |
509 |
1361 |
Total |
2191 |
1134 |
3325 |
One can see that in class 1, models gives 62.6% error whereas for class 2 gives 31.8% error.
Following table shows the confusion matrix
Confusion Matrix |
||
|
Predicted Class |
|
Actual Class |
1 |
0 |
1 |
567 |
965 |
0 |
737 |
1531 |
Out of 1532 cases who buy the new service, model estimates the 567 cases that buys a new service whereas 965 cases are misclassified.
Out of 2268 cases who do not buy the new service, 1531 cases are correctly classified whereas 737 cases are misclassified.
Class 1 errors are more undesirable in this model. As 62.6% cases are misclassified in Class 1 whereas only 31.8% cases are misclassified in Class 0.
Comparison of accuracy of second model (given in 4-2-1) with the first model (given in 4-1-1):
We observed that R2 for second model is 0.1159 whereas for first model is 0.08322. We can also observed error reports, out of 3800 cases, 1702 cases are misclassified in first model whereas out of 3325 cases, 1169 cases are misclassified in second model. Percentage of misclassification is more in first model than second model. One can observed the performance indicators of both the models from following table:
Indicator |
First Model |
Second Model |
Precision |
0.4348 |
0.5873 |
Recall (Sensitivity) |
0.3701 |
0.4747 |
Specificity |
0.6750 |
0.7688 |
F1-Score |
0.3999 |
0.5250 |
As the all performance indicators shows better results for second model than first model. From ROC curve, we can compare the AUC value of first and second model. AUC value of second model is 0.7199 and for first model is 0.66878 shows that second model is better than first one.
From all the discussion, second model is preferred over the first model.
Let S be the strike price of share at which option is exercised after eight month which is decided now, P is the price of share after eight months, c is the cost of option then profit is
In given example, S = $29, P = $26.4 and c = $1.5
As S > P, Profit = S – P – c = $ (29-26.4-1.5) = $ 1.1 per share
We referred Hull and Basu (2016).
S ($): Exercised Price |
29 |
Price in eight month ($) |
Profit ($) |
|
c ($): Cost of option |
1.5 |
15 |
12.5 |
|
16 |
11.5 |
|||
17 |
10.5 |
|||
18 |
9.5 |
|||
19 |
8.5 |
|||
20 |
7.5 |
|||
21 |
6.5 |
|||
22 |
5.5 |
|||
23 |
4.5 |
|||
24 |
3.5 |
|||
25 |
2.5 |
|||
26 |
1.5 |
|||
27 |
0.5 |
|||
28 |
-0.5 |
|||
29 |
-1.5 |
|||
30 |
-1.5 |
|||
31 |
-1.5 |
|||
32 |
-1.5 |
|||
33 |
-1.5 |
|||
34 |
-1.5 |
|||
35 |
-1.5 |
References:
Anderberg, M.R., 2014. Cluster analysis for applications: probability and mathematical statistics: a series of monographs and textbooks (Vol. 19). Academic press.
Draper, N.R. and Smith, H., 2014. Applied regression analysis (Vol. 326). John Wiley & Sons.
Hull, J.C. and Basu, S., 2016. Options, futures, and other derivatives. Pearson Education India.
Shmueli, G., Patel, N.R. and Bruce, P.C., 2016. Data mining for business analytics: Concepts, techniques, and applications with XLMiner. John Wiley & Sons.
Essay Writing Service Features
Our Experience
No matter how complex your assignment is, we can find the right professional for your specific task. Contact Essay is an essay writing company that hires only the smartest minds to help you with your projects. Our expertise allows us to provide students with high-quality academic writing, editing & proofreading services.Free Features
Free revision policy
$10Free bibliography & reference
$8Free title page
$8Free formatting
$8How Our Essay Writing Service Works
First, you will need to complete an order form. It's not difficult but, in case there is anything you find not to be clear, you may always call us so that we can guide you through it. On the order form, you will need to include some basic information concerning your order: subject, topic, number of pages, etc. We also encourage our clients to upload any relevant information or sources that will help.
Complete the order formOnce we have all the information and instructions that we need, we select the most suitable writer for your assignment. While everything seems to be clear, the writer, who has complete knowledge of the subject, may need clarification from you. It is at that point that you would receive a call or email from us.
Writer’s assignmentAs soon as the writer has finished, it will be delivered both to the website and to your email address so that you will not miss it. If your deadline is close at hand, we will place a call to you to make sure that you receive the paper on time.
Completing the order and download