1) Data can be described, among other things, in terms of central tendency and spread.
Mean
Median
Mode
Mean. It can be simply defined as the average of numbers. Arithmetically we achieve mean by adding up all the values in a data set and dividing the sum by the number of values making the data set.
Median. Given a data set the, data is arranged in ascending order, the middle number separating the lower half from the higher half in a data set is called the median. It is simply the middle value.
Mode. Given a data set Mode is the number that is repeated most.
d) A boxplot can be used to visualize the distribution of an attribute. Explain how to interpret a boxplot.
A boxplot enables scholars to examine the distributional relationship of data set, further it enables us to study the level of the scores.
In the first step, scores are organized. Secondly the sorted data is grouped into four equal quarters (25% of scores in each subgroup). The four subdivision of the data are referred to as quartiles scores. The quartile groups are labeled 1 to 4 starting at the lowest.
If a box plot is comparatively short –This means the dataset been analyzed has a great agreement and is concentrated towards the mean (Thearling, 2017). In case of students exam performance, the students seems to have scored grades within the same range.
If box plot is comparatively tall – This suggests that the data set under analysis has a great variation. i.e in case of students exam performance, some may have scored high grades while others have scored low grades meaning the separation between the two is very wide.
If box plot lower or higher than the other – This means that there exist a variation between the data set. For instance, the box plot for women may be lower or higher than the men in an election analysis this could mean more men participated in the elections than women and the vice versa applies for the higher ended box plot.
When the box plot are unequal in size – Means that similar views are represented in the wider scale and more variable opinions are held in other parts of the scale which may be narrower (Shmueli, Bruce, Yahav, Patel, & Lichtendahl, 2017). Further using whiskers to interpret a box plot, lower longer whisker means that the students average performance is concentrated towards the lower quartile and the vice versa applies for the longer upper whisker.
Consider the following training data whose goal is to determine whether a car is manual or automatic.
Input |
Hidden |
output |
|
size |
1.0 |
0.469 |
0.6 |
model |
2.0 |
0.523 |
|
engine |
3.0 |
0.572 |
|
0.617 |
In this example manual is encoded as 1 while automatic is encoded by 0
The output is 0.6 and because this value is closer to 1 the neural network predicts the car as automatic.
Input |
Hidden |
Output |
|
size |
1.0 |
0.469 |
0.4 0.51 |
model |
2.0 |
0.523 |
|
engine |
3.0 |
0.572 |
|
0.617 |
In this example the manual will be encoded by (1, 0) and automatic (0, 1)
In the output, the larger of the two node values is in the second position and map to (0,1) hence the neural network predicts the car is automatic.
b) If we have a categorical but ordered input attribute, let’s say with the possible values {Low, Medium, High}, how would you code that? Why is this a good coding for that attribute?
I would encode low as 3 medium as 2 and high as 1
This would be appropriate coding because the attributes are ordered in terms of hierarchy with high indicating better.
A one-dimensional dataset with ten instances is given below:
{1, 1, 2, 3, 5, 8, 13, 21, 33, 54}
Assume that you have to explore a large dataset of high dimensionality and that you know nothing about the distribution of the data. Describe a method for finding the number of clusters in the dataset using k-means, and furthermore, explain how k-means can be applied to find the dimensions in which the clusters separate (i.e. how can you eliminate dimensions that don’t provide any useful information for clustering the dataset, using k-means).
Gap statistic method will be used to determine the number of clusters in the dataset. The method will use output of k algorithims to compare change of dispersion in the cluster under null reference data distribution. According to Thearling, (2017), this method gives more precise results than the other methods. The formula is given below.
1) You work at an e-retailer selling primarily clothes. Now you would like to use data mining, more specifically predictive modeling, to select which customers to target with a promotion for a new line of luxury dresses.
I would break down the input data variables into three major division:
(1) All data related to new line of luxury dresses in terms of prizes quality, sizes, designs and colors.
(2) All Variables related to the different media houses that my company will for the planned advertising and promotion activity for the luxury dress.
(3) All Variables related to the budgetary allocation for the advertising and promotion activities with respect both external advertising cost and internal advertising investment.
I would use the following Output variables for my data allocation.
4, b) Give the main advantage (one per technique) of using the following two modeling techniques for the task presented above: i) random forest and ii) decision trees. Are these properties contradictory, i.e., must we choose one of them, or can we (at least to some degree) have both?
Random forest main advantage is its ability to limit overfitting without increasing errors due to bias and variance (Shmueli, et al, 2017). Using random forest will allow usage of many random features of the data used. This will allow usage of many decision trees instead of just one which is more effective.
The main advantage of using a decision tree would be they are easy and fast to interpret hence making the visualization process quicker and easier (Roiger, 2017).
Both decision trees and random forest can be used together to some degree depending on the data. This is because a random forest is a combination of decision trees. When a decision tree gets very deep overfitting can occur recurring random forest to eliminate the shortcoming.
3265+2=3267
precision = true positive
true positive + false positive
6/6+7
6/13
Ans= 0.462
recall=tp/tp+fn
6/6+48+8
=6/62
Ans=0.0968
This model provides lower accuracy than the previous model. However it maybe preferred in case there is large class imbalance. This is because it will have better predictive power.
It is representing appropriate or Generalization of data’. In machine refers to how accurately the concepts fed into a machine through machine learning model is generally applicable to other given concepts listed in the model when the machine was learning. A well generalized model should accurate represent the raw set data and should be flexible to accommodate new data efficiently (KS, & Kamath, 2017).
It represent data underfitting: According to Lu, Setiono, & Liu (2017), this is a model that cannot appropriately represent the modeling (training) data and cannot appropriately apply the concept of generalization when it comes to new additional data. For us to solve the problem of underfitting we should fit the target variable as the nth degree polynomial resulting to achievement of general Polynomial. As we increase the polynomial degree the training error will tend to decrease. Further, the cross validation error will also decrease as we increase the polynomial degree, forming a convex curve which is more accurate than the latter underfitted one.
It represent Overfitting: It occurs when a model over represents the training data to a point that the modelling adversely influences generalization when new data is added to the model (Ashraf, Ahmad, & Ashraf, 2018). To minimize the limitations of optimal flexibility of data should be increased in the model. Using any of following approaches.
References
Ashraf, N., Ahmad, W., & Ashraf, R. (2018). A Comparative Study of Data Mining Algorithms
for High Detection Rate in Intrusion Detection System.
KS, D., & Kamath, A. (2017). Survey on Techniques of Data Mining and its Applications.
Lu, H., Setiono, R., & Liu, H. (2017). Neurorule: A connectionist approach to data mining. arXiv preprint arXiv:1701.01358.
Roiger, R. J. (2017). Data mining: a tutorial-based primer. Chapman and Hall/CRC.
Shmueli, G., Bruce, P. C., Yahav, I., Patel, N. R., & Lichtendahl Jr, K. C. (2017). Data mining for business analytics: concepts, techniques, and applications in R. John Wiley & Sons.
Thearling, K. (2017). An introduction to data mining.
Essay Writing Service Features
Our Experience
No matter how complex your assignment is, we can find the right professional for your specific task. Contact Essay is an essay writing company that hires only the smartest minds to help you with your projects. Our expertise allows us to provide students with high-quality academic writing, editing & proofreading services.Free Features
Free revision policy
$10Free bibliography & reference
$8Free title page
$8Free formatting
$8How Our Essay Writing Service Works
First, you will need to complete an order form. It's not difficult but, in case there is anything you find not to be clear, you may always call us so that we can guide you through it. On the order form, you will need to include some basic information concerning your order: subject, topic, number of pages, etc. We also encourage our clients to upload any relevant information or sources that will help.
Complete the order formOnce we have all the information and instructions that we need, we select the most suitable writer for your assignment. While everything seems to be clear, the writer, who has complete knowledge of the subject, may need clarification from you. It is at that point that you would receive a call or email from us.
Writer’s assignmentAs soon as the writer has finished, it will be delivered both to the website and to your email address so that you will not miss it. If your deadline is close at hand, we will place a call to you to make sure that you receive the paper on time.
Completing the order and download