Data Mining Project: Boston Housing Dataset Analysis

Task 1: Understand the dataset

Main objective of this project is to use the Boston housing dataset to apply the data mining techniques to resolve a business problem. Analysis the provided data set to provide the suitable business solutions by using the Weka data mining tool. To analysis the provided data by review the current, methodologies and algorithms for business analytics. These are will be discussed and analysed in detail.

Analysis the provided data set, first user needs to understand the data set. The provided Boston housing dataset is described as below (Ahmadi & E Shiri Ahmad Abadi, 2013).

The provided dataset has following attributes such as,

Id – It is used for data instances identifications.
MS Sub Class – It is used to determines the dwelling types
MS Zoning – It is used to determines the sales zoning classification.
Lot Frontage: Linear feet of street connected to property
Lot Area: Lot size in square feet
Street: Type of road access to property
Alley: Type of alley access to property
Lot Shape: General shape of property
Land Contour: Flatness of the property
Utilities: Type of utilities available
Lot Config: Lot configuration
BsmtHalfBath
FullBath
HalfBath
Bedroom
Kitchen
Kitchen Qual
Land Slope: Slope of property
Neighbourhood: Physical locations within Ames city limits
Condition 1: Proximity to various conditions
Year Built: Original construction date
Year Remod Add: Remodel date
Tot Rms Abv Grd
Condition 2: Proximity to various conditions
Bldg Type: Type of dwelling
House Style: Dwelling Style
Sale Type: Type of sale
Sale Condition: Condition of sale
Overall Qual: Percentages’ the overall material and finish of the house
Overall Cond: Percentages’ the overall condition of the house
Sale Price: Sale Amount and more.

Statistics data for provided dataset is shown below.

For ID attributes,

For Sale Conditions (Arabnia, Stahlbock, Abou-Nasr & Weiss, n.d.),

Visualization of provided data set is shown below.

In this task, user needs to discover the relationships existed among all the attributes. Here, we are applying the normalization techniques to discover the relationships among all the attributes in the Boston Housing data. The normalization technique is used to remove the duplicates in the data (Azzalini & Scarpa, 2012).

In this task, user requires to list the potential business analysis for a provided data set. Here, we are using the classification and prediction algorithm to resolve the business problem. And, also provide the effective solutions for that problem. The effective results is used to provides the following benefits for real estate consulting firm such as,

Business benefits
Improve the business process
Support decision making
Support strategy development.

ZeroR is the most straightforward classification methods which depends on the objective and predicts all Predictors .ZeroR classifier essentially predicts the category which is class (Witten, Frank & Hall, 2011). Despite the fact that there is no consistency control in ZeroR, it is helpful for deciding a standard execution as a benchmark for other classification methods. Algorithm Construct a recurrence table for the objective and select it is most regular value. Predictors Contribution There is not something to be said about the Predictors commitment to the model on the grounds that ZeroR does not utilize any of them. Display Evaluation the ZeroR just predicts the greater part class accurately. As referenced previously, ZeroR is helpful for deciding a pattern execution for other classification methods. The ZeroR classification is demonstrated as below (Han, Kamber & Pei, 2012).

=== Classifier model (full training set) ===

ZeroR predicts class value: 180921.19589041095

Time taken to build model: 0 seconds

=== Cross-validation ===

=== Summary ===

Task 2: Relationships discovery among features

Correlation coefficient -0.0508

Mean absolute error 57444.7035

Root mean squared error 79439.3263

Relative absolute error 100 %

Root relative squared error 100 %

Total Number of Instances 1460

The ZeroR algorithm predicts the mean Boston House class values is 180921.19589041095. it must achieve an RMSE better than this value. The ZeroR algorithm predicts the tested negative value for all instances as it is the majority class, and achieves an accuracy of 82 % (KaluÅ¾a, 2013).

=== Stratified cross-validation ===

=== Summary ===

Correctly Classified Instances 1198 82.0548 %

Incorrectly Classified Instances 262 17.9452 %

Kappa statistic 0

Mean absolute error 0.1056

Root mean squared error 0.2289

Relative absolute error 100 %

Root relative squared error 100 %

Total Number of Instances 1460

=== Detailed Accuracy By Class ===

TP Percentage FP Percentage Accuracy Recall F-Measure MCC ROC Area PRC Area Class

1.000 1.000 0.821 1.000 0.901 ? 0.496 0.819 Normal

0.000 0.000 ? 0.000 ? ? 0.495 0.069 Abnorml

0.000 0.000 ? 0.000 ? ? 0.489 0.084 Partial

0.000 0.000 ? 0.000 ? ? 0.199 0.003 AdjLand

0.000 0.000 ? 0.000 ? ? 0.433 0.007 Alloca

0.000 0.000 ? 0.000 ? ? 0.500 0.014 Family

Weighted Avg. 0.821 0.821 ? 0.821 ? ? 0.494 0.685

=== Confusion Matrix ===

a b c d e f <– classified as

1198 0 0 0 0 0 | a = Normal

101 0 0 0 0 0 | b = Abnorml

125 0 0 0 0 0 | c = Partial

4 0 0 0 0 0 | d = AdjLand

12 0 0 0 0 0 | e = Alloca

20 0 0 0 0 0 | f = Family

In light of the above tables and figures, we can obviously observe that for the Boston Housing data most significant accuracy is 100% and the least is 17.94 %. The other algorithm yields a normal accuracy of around 85%. In fact, the most important accuracy has a place with the Multi scheme classifier. ZeroR Classifier present at the base of the outline with percentage around 100%. A normal of 1198 instances out of absolute 1460 instances is observed to be effectively characterized with most elevated score of 262 occurrences contrasted with 1460 instances, which is the least score (Maimon & Rokach, 2010). The total time required to build the model is likewise a basic parameter in contrasting the classification algorithm. It is regular to recognize the reliability quality of the data gathered and their legality. This analysis suggests a normally utilized pointer which is mean of supreme errors and root mean squared errors. Then again, the relative errors are additionally utilized. It is found that the most important error is found in ZeroR Classifier with a normal score of around 0.821. A algorithm which has a lower error percentage will be favoured as it has all the more powerful classification capability, so after investigation we can say that ZeroR algorithm isn’t appropriate for a Data since it has most extreme number of errors and can’t classify the data effectively (Olson, 2017).

References

Ahmadi, F., & E Shiri Ahmad Abadi, M. (2013). Data Mining in Teacher Evaluation System using WEKA. International Journal Of Computer Applications, 63(10), 12-18. doi: 10.5120/10501-5268

Arabnia, H., Stahlbock, R., Abou-Nasr, M., & Weiss, G. DMIN 2017.

Azzalini, A., & Scarpa, B. (2012). Data Analysis and Data Mining. Oxford: Oxford University Press, USA.

Han, J., Kamber, M., & Pei, J. (2012). Data mining. Waltham, MA: Morgan Kaufmann/Elsevier.

KaluÅ¾a, B. (2013). Instant Weka how-to. Birmingham: Packt Pub.

Maimon, O., & Rokach, L. (2010). Data mining and knowledge discovery handbook. New York: Springer.

Olson, D. (2017). Descriptive Data Mining. Singapore: Springer Singapore.

Witten, I., Frank, E., & Hall, M. (2011). Data mining. Burlington, Mass.: Morgan Kaufmann Publishers.

Turn in your highest-quality paper
Get a qualified writer to help you with

“ Data Mining Project: Boston Housing Dataset Analysis ”

Get high-quality paper

NEW! AI matching with writer