Car Evaluation Using Machine Learning

Abstract.

Cars are essentially part of our regular day to day life. There are various kind of cars produced by different manufacturers; subsequently the buyers has a decision to make.

When as an individual consider of buying a car, there are numerous aspects that could influence his/her choice on which kind of car he/she is keen on. The choice buyer or drivers have generally relies upon the price, safety, and how luxurious and how spaciuous the car is.

Car evaluation database is significant structure information that everyone should take a look at for the car features and useful in decision making. This dataset are labeled according to the specification of PRICE, COMFORT and SAFETY. The dataset utilized in this assignment can be access https://archive.ics.uci.edu/ml/datasets/Car+Evaluation

The objective of this report is especially to determine the decision making, identifying the car variables like car price value with other various variable to decide between a good acceptable cars from the unaccepted values from the target value.

2. INTRODUCTION.

Understanding the idea in making a decision on a choice in getting a car is basic to everybody particularly the first time buyer or anyone who are inexperienced in how the car business functions. Generally we need a car as a methods for transportation however as we include fun into it and we tend to forget that we shouldn’t underestimate.

Get Help With Your Essay
If you need assistance with writing your essay, our professional essay writing service is here to help!
Essay Writing Service

Classifying a good car from a better than average to a terrible one are normally being finished physically with the assistance of car sales representative who guides us to purchase this along these lines or from the conclusion of our family and companions who had past experienced with vehicle inconveniences. It would have been better to have a device that can check car features and tell that it’s an X car or a Y car. If there is such device there should be no worries in purchasing a car.

In present times it is continuously the car sales representative who encourages us to purchase this car or not. We may or probably won’t know it consciously however we are basically ignoring the factors that would help us financially, comfortably, and safety in a long run.

In this assignment we process the data, exploring the variables relationship between the attributes and we model the data from different classification models, those are K nearest neighbor and Decision trees in terms of their best set of parameter for each case and performance on car evaluation data set.

2.1 Dataset Attributes

The data set that we accesed from the UCI respository which is collection of observation of the specified attributes of a car, it was donated by Marco Bohanec in 1997.

The Car Evaluation dataset contains following concept structure:

CAR- car acceptability.

PRICE overall price

buying -buying price

maint -price of the maintenance

TECH- technical characteristics

COMFORT- comfort

doors -number of doors

persons- capacity in terms of persons to carry

lug_boot- the size of luggage boot

safety- estimated safety of the car

2.1. A. Do we need all the Variables?

Getting rid of unnecessary variables is a good initial step when managing with any data set, since dropping attributes diminishes complexity nature and can make calculation on the data set quicker. Regardless of whether we should dispose an attributes or not will rely upon size of the data set and the goal of our investigation however in any case be useful to drop variables that will only distract from the aim and goal of the assignment.

The car directly relay on six attributes : ‘buying’,’maint’,’doors’,’persons’,’lug_boot’,’safety’,’classes’.

The dataset contains 1727 instance and possible values each attribute are below

INPUT Attributes

buying

vhigh, high, med,low

maint

Vhigh,high, med, low

doors

2, 3, 4, 5more

persons

2, 4, more

lug_boot

Small, med, big

safety

Low, med, high

Table 1: attributes values

Missing attribute values: The Car Evaluation doesn’t contain any missing values- NONE

2.1. B. TARGET ATTRIBUTE- classes

The data analysis is done on this dataser to identify some patterns and also attribues range with their Percentages(frequency).

classes

Number of observation per class

Percentage

unacc

1209

70.023

acc

384

22.222%

good

3.993%

vgood

3.762%

Table 2: Class Distribution

The target variable classes indicates whether each car is unacc, acc, good, vgood , since predicting analysis is our goal

–

Attribute Characteristics: Categorical

Associated Tasks : Classification task to acquire the knowledge from the data set

Based on the distribution , from the table it looks like more no of instances are in unacc classes which means its skewed data, so it means we have choosen right task(classfication) to analyse this distribution.

3. Methodology

3.1 data collection

The Car Evaluation Dataset is selected from UCI Machine learning repository for this assignment. This dataset contains 1727 instance an d 6 attributes. We are importing necessary pandas modules to the read the car evaluation data set from our system drive.

3.2 data preprocessing

The dataset from UCI respository has be cleaned and it is standard quality before the module analysis is proceed.Data set often contains missing values and extreme values called outliers and this values can effect out test and even sometimes it even can cause module to fail. It better to remove all outliers and fill the missing values with near values. In our dataset , we don’t hav any missing values or any sort of outliers.

buying 0

maint 0

doors 0

persons 0

lug_boot 0

safety 0

classes 0

dtype: int64

From the above output, detecting the missing values is easy task more over it is difficult to decide how to handle missing values, missing values in categorical data set are not troubling because we can treat them as NA on the other hand missing vales in numerical variables will cause troublesome to our analysis. Before we start up , data cleaning is done.

Before starting up the analysis it’s a good idea to start off by checking the dimensiona of our dataset by checking the share and description of the variables.

Exploring the attributes and variables .

The initial step in data exploratory analysis is reading the data information and then exploring the attributes factors. It is essential to get a sense of how many variables and cases are, and there attributes datatypes, the possible range of values that attributes take on.

Transforming the Variables (Data transformation)

When we first load the dataset , few variables may be encoded as datatypes and they doesnt fit well in our dataset for example Classes variable(Target variable) that indicates the Unacceptable, acceptable, good and very good that only takes the values like 1, 2, 3 and 4

Most of the variables are encoded as object type and in this data analysis all the variable holding a categorical variables and the variables are in string format, to go further operation we need to change the String type to integer type, more over this models requires the variables to be in integers and we have converted by giving specified number to each variable (encoding).

3.3 Data exploration

Data exploration is a technique similar to data analysis where data is summarized in visual exploration and the characteristics of data.

The data exploration includes following

i. Univariate

Exploration and analyze of each variable

ii. Bivariate

Exploration and analyze pair of variables and their relationship

iii. Multivariate

Exploration of multiple variables in the data set

Here we will check each feature with the class in the distribution.

Above graph which give the number of count (unique values in the column) vs the classes

From the give graph result almost 70% of cars are in classes unacceptable(unacc), which means it skewed left distribution.

In the above graph, from the out of total 1727 instances of car in the datset 1209(70%) were unacceptable, 384(22%) were acceptable, 69(3.9%) were in good and 65(3.7%) are in very good. From the grpagh we can com to conclusion that more then half of the cars evaluated were not in acceptable.

Buying histogram tells -are the distribution of the classes trend to be uniformly distributed , while very high and high buying cost of a car will probably made a car be unaccepted.

With the very high and high maintenance price will probably made a car be to unaccepted

Here,distribution of each classes tend to be uniformly distributed whereas 2 doors will effect a car to be in unaccepted classes.

In this persons distribution of the classes, with 2 persons captity of the car it will be unaccepted

In this luggage boot space in the car ,where small luggage boot is casing the car to be in unaccepted.

In this safety distribution of the each classes it is seen that normal distribution and low safety will most likely caused a car being unaccepted.

Where in measurement view of normal distribution is a nature in real-world cases practically, then we can decide safety is the most important features in our module analysis

3.4 Splitting of dataset and randomized

Training and Testing

X- Data frame containing input data

Y – Output data / result which has to be predicated

In this assignment, we have divided the dataset into training set and testing set and the 3 splits used in this assignment are show in the table ab.

Training and Testing % split

50% 50%

60% 40%

80% 20%

Table ab: Training and Testing Split

3.5 data modeling(classification)

The experiment is carried on using the classifiers models, those are K-nearest neighbors and Decision trees. This experiment is to determine which classifier best suits for our data set in terms of classifying the trained and tested set and also making prediction module obtained during the training process. The detailed procedure of the experiment is below

K-nearest neighbors

K-NN is a classifier which just finds the classes of the k-nearest neighbors (based on a distance metric the shortest distance between the samples which is known as Euclidean) and then find the classes in the larger part and assign that class to the test pattern. In Here we have started comparing 3 splits of classes

Knn module is a technique of learning where a particular instance is mapped against many labels. Here we are pre-specifying the labels to train our module.

N_neighbour and power parameter, p=1 or 2

Decision trees

Decision tree is a module that uses a tree-like-graph or module of condition of decisions and their possible consequences. It is one approach to display an algorithm that contains only conditional control statements.

It follows a flowchart like structure in each internal node that is condition on each attribute, each branch represents the outcome of the condition, and each leaf node represents a class table. The top down approach from the root to the leaf represents classification rule.

Root Node:

This Node represents the total population (instances) and furthers breakdown into branches class sub-nodes based the conditions.

Decision node

When a sub node gets divided into further sub nodes then its called decision node

Leaf node

When node cannot spilt further into sub nodes

3.5 Accuracy Test

Accuracy: The measurement of correct classifications / the total amount of classifications.

Train accuracy: The accuracy of a model on samples it was constructed on.

Test accuracy: The accuracy of a model on samples it hasn’t seen.

The accuracy is tested on each splited data set from the table ab and report is printed to compare which model suits best.

Result

The presentation of the results is based on following model analysis

a.Classification

Knn

Confusion matrix

Below the output of 80-20split of the data set confusion matric for ou knn model.

Confusion Matrix:

[[231 2 1 0]

[ 14 73 0 0]

[ 1 1 10 0]

[ 0 2 1 10]]

Knn is fit good enough on the test set where only 22 instances are missed .

To get better understand of the module we alos used another measurement precision, recall,f1 score.

Given the result of measurements of the assignment 80:20 split for the KNN module

precision: 0.9270637898686679

recall: 0.8572060123784262

f1 score: 0.8875617588932807

F1 is combination of precision and recall then we can say that f1 score is used to measure our model performance.

IN this analysis our splitting ration of knn 80:20 seems good enough at its performance

The accurancy acived for different dataset slipt are presented in the table c , b

KNN

Splitting Percentage (Training % and Testing %)

80-20

n_neighbors

Power variable (p)

Testing accuracy

93.64161849710982%

Classification error rate

6.358381502890175%

Confusion Matrix

[[231 2 1 0]

[ 14 73 0 0]

[ 1 1 10 0]

[ 0 2 1 10]]

precision

0.9270637898686679

recall

0.8572060123784262

f1 score:

0.8875617588932807

Table c: CLASSIFICATION OF KNN

Decision tree

Decision trees max_depth=6

Splitting Percentage (Training % and Testing %)

80-20

Testing accuracy

93.64161849710982%

Classification error rate

6.358381502890175%

Confusion Matrix

[[224 9 1 0]

[ 1 78 7 1]

[ 0 0 11 1]

[ 0 2 0 11]]

precision

0.8242653161281193

recall

0.9041592985558503

f1 score

0.8545574400650303

Table D: CLASSIFICATION OF DECISION TREE

Discussions

The spliitng of data set from the two comparison shows that K nearest neighbor and decision tree have exctally accuracy across the data slept ratio (80:20) 93.64%.

F1 score is combination of precision and recall then we can say that f1 score is used to measure our model performance.

So, here Knn classifier model have higher f1 score than compared to Decision tree.

To provide a distinction between two classifers and their performance ,comparing the results of two classifers are show in the table x, table y under classification ; it is observed that knn classifer best suist for our data set.

Also, it is seen that decision trees has less f1 score even though knn and decision tree have same accurancy.

Conclusion

The comparison analysis for the classiers used in this assignment show that K nearest neighbor and Decision tree have sam performanc in terms of accurancy.

However, in terms of f1 score k nearest neighbor seems to be best compared to decision tree.

Result table

K nearest neighbor (knn)

KNN

Splitting Percentage (Training % and Testing %)

50-50

60-40

80-20

n_neighbors

Power variable (p)

Testing accuracy

91.20370370370371%

92.61939218523878%

93.64161849710982%

Classification error rate

8.79629629629629%

7.38060781476122%

6.358381502890175%

Confusion Matrix

[[577 6 3 0]

[ 46 167 3 3]

[ 4 3 21 1]

[ 1 2 4 23]]

[[459 4 2 0]

[ 28 148 3 0]

[ 3 5 15 0]

[ 1 5 0 18]]

[[231 2 1 0]

[ 14 73 0 0]

[ 1 1 10 0]

[ 0 2 1 10]]

precision

0.8465658156996925

0.8996017827059918

0.9270637898686679

recall

0.809500828387994

0.8040215824237817

0.8572060123784262

f1 score:

0.8247259934493818

0.8457758780971122

0.8875617588932807

Table x:Accuracy results for K nearest neighbor

Result table

Decision trees

Decision trees max_depth=6

Splitting Percentage (Training % and Testing %)

50-50

60-40

80-20

Testing accuracy

92.24537037037037%

93.19826338639653%

93.64161849710982%

Classification error rate

7.754629629629628%

6.80173661360347%

6.358381502890175%

Confusion Matrix

[[568 16 2 0]

[ 16 179 21 3]

[ 0 0 24 5]

[ 0 4 0 26]]

[[449 14 2 0]

[ 2 156 18 3]

[ 0 0 19 4]

[ 0 4 0 20]]

[[224 9 1 0]

[ 1 78 7 1]

[ 0 0 11 1]

[ 0 2 0 11]]

precision

0.7868611018471237

0.7800093405644288

0.8242653161281193

recall

0.8702219370468116

0.8741300168982008

0.9041592985558503

f1 score

0.8178696121130331

0.815354746873236

0.8545574400650303

Table y:Accuracy for Decision Tree

Here with the data split with 80-20 both the modules knn and dt hav same accuracy but accuracy cant be the fair criteria to determine unbalanced classification so lets check with f1 score and knn with 80-20hav heighest with 0.88 when compared to dt f1 score of 0.85

Conclusion

KNN module has height accuracy with split (80-20) than compared to Decision trees and it is the best suitable module for our data set with the following data parameters like n_neighbors = 7 and power variable , p = 2

We able to get testing accuracy of 93.64161849710982%

On based on splitting the data set which is 80:20 seems perform vey good

All the attributes plays a vital role for customers in assessing whether the car is in accepted or unaccepted class

Safety and person’s capacity are main factors in rejecting the car classes as unacceptable

No of doors plays no importance in deciding the classes of the car

References

Turn in your highest-quality paper
Get a qualified writer to help you with

“ Car Evaluation Using Machine Learning ”

Get high-quality paper

NEW! AI matching with writer