Analysis Undergraduate Student Performance using Feature Selection Techniques on Classification Algorithms
Abstract— Educational Data Mining employed in various field which include various attributes for analysis students details such as name, attendance, class test, lab test, spot test, assignment, result on educational data. In this research mainly focus on calculating performance of undergraduate students in computer science and engineering by a predictive data mining model using feature selection methods with classification algorithms. Feature selection techniques is proposed in data preprocessing process to find out the most inherent and important attributes so that we analyze and evaluate the students better performance. We collected records of 800 students from final year student, studying in undergraduate level from North Western University .In this paper, we used four feature selection methods: genetic algorithms, gain ratio, relief and information gain. Also used to five classification algorithms: K-Nearest Neighbor, Naïve Bayes, Bagging , Random forest, and J48 Decision Tree in this research. Experimental result shown that Gain Ratio feature selection method with 10 feature selected gives the best result on 87.375% accuracy with k-NN classifier. we used dissimilar feature selection techniques are matched by student performance prediction constructed on students academic performance.
Keywords— data mining; feature selection; genetic algorithm ; gain ratio; Relief; information gain; classification; student performance.
I. Introduction
Educational data mining (EDM) describes a research ground anxiety with the application of data mining. Classification is very beneficial data exploration that expect students theoretical performance. In data mining knowledge discovery refers to the broad process of finding knowledge in data. knowledge discovery involves some process such as data selection, data cleaning, data transformation, data integration, pattern evaluation, pattern discovery. The application of data mining is widely prevalent in education system. Educational data mining is an emerging field which can be effectively applied in the field of education. The educational data mining uses several ideas and concepts such as Association rule mining, classification and clustering.[7] The knowledge that emerges can be used to better understand students’ promotion rate, students’ retention rate, students’ transition rate and the students’ success [1].
The data mining system is pivotal and crucial to measure the students’ performance improvement.
we create and initialize the individuals in the population. As the genetic algorithm is a stochastic optimization method, the genes of the individuals are usally initialized at random.
One of the most advanced algorithms for feature selection is the genetic algorithm. This is a stochastic method for function optimization based on the mechanics of natural genetics and biological evolution. [2]
In this paper, we use genetic algorithms as a feature selection method to optimize the performance of a predictive model, by selecting the most relevant features.
A classification built on an connotation instructions algorithm is used to construct a classifier to help estimate the students performance.
By upgraded results, there are some techniques that can rise the correctness of the results tested on data processing techniques.
This technique is able to find info in the form of patterns, topographies known as knowledge.
In this research, mostly emphasis on attribute selection system. We used four Feature selection methods: Genetic algorithm (GA), Gain ratio(GR), Relief and Information gain. Then, we compared student performance by five classification algorithms such as K-Nearest Neighbor(KNN), Naïve Bayes(NB), Bagging, Random forest(RF), and J48 Decision Tree algorithms for each feature selection techniques.
This paper is structured as follows: in Section 1 we discuss about previous researches on student performance prediction in educational field and theirs influence factors. In this section also discuss the effort to deal with feature selection. In Section 2, we describe the step for implementing research from data preparation, preprocessing stage, oversampling stage and selecting best feature from dataset (based on : attribute, dimension and subset). In Section 3 we provide the experimental result and we explain about the results’ analysis. Finally in Section 4 we offer conclusion and discuss the future works.
II. Related work
Now days educational data mining is emerged as a very active research area because there are lots of things in this filed are not exposed. Work connected to student performance, student behavior analysis, faculty performance and impact of this factor on student final performance need much attention.
J K Jothi and K Venkatalakshmi conducted the students’ performance analysis on the graduate students’ data collected from the Villupuram college of Engineering and Technology. The data included five year period and applied clustering methods on the data to overcome the problem of low score of graduate students, and to raise students academic performance [3].
Feature selection is a fundamental stage related to classification accuracy. As the dimensionality of study domain expands, the number of features become higher [2].[4]
A comparison between GA and full model selection (support vector machines and a particle swarm model selection) on classification problems, the results showed that GA gave better performance on problems with high dimensionality and large training sets [5]
Mythili M S and Shanavas A R applied classification algorithms to analyze and evaluate school students’ performance using weka. They came with various classification algorithms, namely J48, Random Forest, Multilayer perception, IBI and decision table with the data collected from the student management system [6].
Noah, Barida and Egerton conducted a study to evaluate students’ performance by grouping the grading into various classes using CGPA. They used different methods like Neural network, Regression and K-means to identify the weak performers for the purpose of performance improvement [8].
Baradwaj and pal described data mining techniques that help in early identification of student dropouts and students who need special attention. Here they used a decision tree by using information like attendance, class test, semester and assignment marks [9].
Remesh, Parkavi, and Yasodha conducted a study on the placement chance prediction by investigating the different techniques such as Naive Bayes Simple, MultiLayerPerception, SMO, J48, and REPTree by its accuracy. From the result they concluded that MultiLayerPerception technique is more suitable than other algorithms [10].
III. Proposed Method
The main ideas of the proposed approach are to increase the performance of classification accuracy and gain the essential features. Establish the features to discover an best set of attributes. This task is carried out using state-of-the-art dimension selection algorithms, namely Genetic Algorithms (GA),Gain Ratio (GR), Relief, Information Gain attribute evaluation (IG).
Data collection
Data Preprocessing
Data cleaning and transformation
Feature Selection methods:
RELIEF
GR
IG
GA
Classifiers:
KNN, NB,Bagging,RF,J48
Evaluation
Final Model
Fig 1: Proposed method
Finally a subset of attributes are select for the classification stage. Attribute removal has played a important role in many classification methods. Eventually, five classification methods, which are measured very strong in solving non-linear problems, are chosen to estimate the class possibility. These methods are K-Nearest Neighbor (KNN), Naïve Bayes(NB), Bagging, Random forest(RF), and J48 Decision Tree classifier.
A. Data Selection
Data used Students’ Academic Performance datasets that consists of 800 students records with 15 features. Data are collected from Department of computer science and engineering, North Western University, Khulna, Bangladesh. Attributes datasets shown in table i.
In this research used 15 attributes such as id, attendance, assignment, class test, lab test, spot test, skill, central viva, extra curriculum activities, quiz test, project/presentation, backlog. Grades are assign to all the students using following mapping: A (91% – 100% ), B (71% – 90% ), C (61% – 70% ) ,D (41% – 60% ) ,F (0% – 40% ) .
Final semester result and final cgpa are assign to all the students using following mapping: A (75% – 100% ), B (70% – 74% ), C (65% – 69% ) ,D (60% – 64% ) ,F (0% – 60% ) .
TABLE I. LIST OF ATTRIBUTE DATASET
Attributes No.
Attributes Name
Possible Values
1
Student Id
Id of the student
2
Attendance
A,B,C,D,F
3
Assignment
A,B,C,D,F
4
Class test
A,B,C,D,F
5
Lab test
A,B,C,D,F
6
Spot test
A,B,C,D,F
7
Skill
A,B,C,D,F
8
Central viva
A,B,C,D,F
9
Extra curriculum activities(ECA)
YES/NO
10
Quiz test
A,B,C,D,F
11
Project/Presentation
YES/NO
12
Backlog
YES/NO
13
Final semester result
A,B,C,D,F
14
Final CGPA
A,B,C,D,F
15
Class
Excellent ,Very Good, Good, Average ,Poor
Extra curriculum activities are divided into two classes: yes(1) and no(0).Quiz test is divided two classes: yes(1) and no(0). And Backlog also divided two classes: yes(0) and no(1) are assign to all the students.
B. Data Preprocessing
1) Data cleaning and transform
Data cleaning is the process of identify and removing corrupt records from a record set.in our student data noisy data remove from dataset.
Data transform is a task to prepare the selected data into format that ready to process. In our experiment, we transformed student grade data by discretized into a categorical classes. A class is divided into 5 classes consisting of Excellent, Very Good, Good, Average and Poor . Final class based on this index shown in table II.
TABLE II. CATEGORICAL CLASSES
Class
Range
Excellent
(11.8 – 13)
Very Good
(9.2 – 11.7)
Good
(7.9 – 9.1)
Average
(5.3 – 7.8)
Poor
(0 – 5.2)
2) Feature Selection
Feature selection is called attribute selection.We used four feature selection methods such as genetic algorithms, gain ratio, relief, information gain. Using this method find out optimal feature for classify better accuracy.
Genetic Algorithms(GA). Genetic Algorithms (GA) are theory algorithms that simulate the action of publication and instinctive selection. Each attribute in data set is calculated as a gene with separate linear series called chromosomes.[11]
GA consequence to initialize a population of solutions. Three operators such as selection, crossover, and mutation are used to the population. Fitness function used in genetic algorithm that is evaluated until optimal solution is arrived.
Gain Ratio (GR). The GR is computed as information gain divided by the entropy of the attribute’s value
GainRatio(Class,Attribute)=InfoGain(Class,Attribute)/H(Attribute) [12]
where H(Attribute) is the entropy of the attribute. GR measures the relative worth of an attribute respect to the class.
Relief. In the Relief algorithm, a good discriminating attribute is defined as the attribute that has the similar feature values in the similar class and dissimilar feature values in dissimilar classes. It uses a nearest neighbor method to calculate relevancy scores for each attribute. It evaluates the worth of an attribute by repeatedly sampling an instance and computing given attribute value based on the nearest instance of the same and different class.
Information Gain (IG). The Information Gain (IG) is a measure based on Entropy. The formula for IG is:
InfoGain (Class ,Attribute)=H(Class)-H(Class | Attribute) [13]
where H (Class) is the total entropy of the class, and H( Attribute, Class) is the conditional entropy of the class given the attribute.
C. Classification
We used five classification algorithms such as K-Nearest Neighbor , Naïve Bayes, Bagging , Random forest, and J48 Decision Tree. This algorithms to mine the data from feature selection steps.
K-Nearest Neighbor (KNN). The K-Nearest Neighbor algorithm called KNN, is a classification algorithm. It is more widely used in classification problems.
K-NN fundamentally works on the belief that the data is connected in a feature space. Hence, all the points are considered in order, to find out the distance among the data points. Euclidian distanceor Hamming distance is used according to the data type of data classes used. In this a single value of K is given which is used to find the total number of nearest neighbors that determine the class label for unknown sample. If the value of K=1, then it is called as nearest neighbor classification.[14]
Naïve Bayes (NB). Naive Bayes is a classification algorithm for binary (two class) and multi-class classification problems.
Bayesian theorem provides an equation for calculating posterior probability P(c | x) from P(c), P(x) and P(x | c):
p(c | x) = p(x | c)*p(c) / p(c) [15]
• P(c | x): the posterior probability of class (c, target) given predictor (x, attributes).
• P(c): the prior probability of class.
• P(x | c): the likelihood, which is the probability of predictor given class.
• P(x): the prior probability of predictor.
Bagging. Bagging is a machine learning ensemble meta-algorithm designed to improve the stability and accuracy of machine learning algorithms used in statistical classification. It also reduces variance and helps to avoid overfitting. Although it is usually applied to decision tree methods.[16]
Random Forest (RF). Random forests are an ensemble learning method for classification. Its correct for decision trees habit of overfitting to their training set.
Random forest is the combination of different decision trees, used to classify the data samples into classes. It is commonly used statistical technique used for the classification. The worth of each distinct tree in not essential, the purpose of random tree is to reduce the error rate of the whole forest. The error rate depends upon two factors i.e. correlation between two trees and the strength of the tree.[17]
J48 Decision Tree. J48 is an algorithm used to generate a decision tree developed by Ross Quinlan mentioned earlier.
It is an implementation of C4.5 in WEKA. The algorithm uses a greedy technique to induce decision trees for classification and uses reduced error pruning. J48 can use both discrete and continuous attributes, attributes with differencing lost and training data with missing attribute values.[18]
D. Evaluate the results
In this Experiment compares the accuracy results by using the selected attributes on feature selection techniques with each classification techniques. Calculate the accuracy using ten-fold cross-validation. Cross validation is a techniques that validating the accuracy.
WEKA (Waikato Environment for Knowledge Analysis) is used as a data mining tool. Waikato Environment for Knowledge Analysis is a suite of machine learning software written in java, developed at the University of Waikato, New Zealand. It is free software licensed under the GNU General Public License.
Matlab and weka tool used for feature selection, pre-processing and classification.
EXPERIMENT AND RESULTS
The suggested approach for the purpose of predicting student performance applied in this study is carried out in two major phases. In the first phase, the feature space is searched to reduce the feature numbers and prepare the conditions for the next step. This task is carried out using four dimension reduction techniques, namely GA, GR, Relief, IG Algorithms. At the end of this step a subset of features are chosen for the next round.
. The optimal features of these techniques are summarized in table 3. Afterwards, the selected features are used as the inputs to the classifiers. Five classifiers are proposed to estimate the success possibility as mentioned previously, these methods include KNN, NB, Bagging, Random Forest and J48 decision tree.
Fig 2: Student Data Set
Fig 3: Visualize Class Attributes
In our experiment, we applied all dataset to each feature selection method and then, we got the result of selected features set as table III below:
This selected features with five classifiers such as K-Nearest Neighbor , Naïve Bayes, Bagging , Random forest, and J48 Decision Tree compute the performance set as table IV below:
In order to evaluate goodness of each feature selection, we needed further experiment by doing classification of selected features from prior stage
FS Method
No. of Selected feature
Selected feature
GA
10
1,2,3,4,7,9,10,11,12,13
GR
10
1,3,4,6,7,8,10,11,12,13
RELIEF
09
1,2,3,4,8,10,11,12,13
IG
10
1,2,3,4,5,6,7,8,12,13
TABLE III. LIST OF SELECTED FEATURE
Table IV: Performance measures of selected features
Classifier
Performance index
GA
GR
RELIEF
IG
KNN
Accuracy
84.875
87.375
85.875
77.125
Precision
0.785
0.874
0.859
0.772
Recall
0.785
0.874
0.859
0.771
F-Measure
0.784
0.874
0.858
0.771
ROC Area
0.882
0.936
0.931
0.875
Naïve bayes
Accuracy
75.5
80.375
80.125
76
Precision
0.759
0.812
0.807
0.768
Recall
0.755
0.804
0.801
0.760
F-Measure
0.756
0.805
0.803
0.760
ROC Area
0.929
0.948
0.948
0.932
Bagging
Accuracy
76.375
81.5
80
75.25
Precision
0.769
0.816
0.801
0.752
Recall
0.764
0.815
0.800
0.753
F-Measure
0.764
0.815
0.800
0.752
ROC Area
0.930
0.954
0.954
0.931
RF
Accuracy
81.625
86.75
86.625
81.125
Precision
0.816
0.808
0.866
0.811
Recall
0.816
0.809
0.866
0.811
F-Measure
0.816
0.808
0.866
0.811
ROC Area
0.951
0.878
0.973
0.957
J48
Accuracy
77.75
79.25
81.875
77
Precision
0.777
0.792
0.819
0.771
Recall
0.778
0.793
0.819
0.770
F-Measure
0.777
0.792
0.819
0.770
ROC Area
0.903
0.902
0.922
0.879
Table IV shows that accuracy results of students’ performance analysis based on students’ dataset. It clearly reveals that Random Forest is a very best classifier for analyzing the students’ performance result with good accuracy.
We computed the accuracy of selected features with all four methods by choose a five classifiers (KNN, Naïve bays, Bagging, RF, and J48 Decision Tree). The result shown as table V and figure 4 below:
Table V: Comparison of FS method
FS Method
KNN
NB
Bagging
RF
J48
GA
84.875
75.5
76.375
81.625
77.75
GR
87.375
80.375
81.5
87.375
79.25
RELIEF
85.875
80.125
80
86.625
81.875
IG
77.125
76
75.25
81.125
77
Fig. 4: Comparison of accuracy of feature selection
Now we computed the Highest accuracy of selected features with Feature selection methods by classifiers. The result shown as table VI and figure 5 below:
The selected attributes on both feature selection methods are further tested on the classification algorithm. In this experiment compares the accuracy results using all selected feature. In table VII, shows Genetic Algoriths(GA) feature selection method with K-Nearest Neighbor(KNN) is 84.875% accuracy. Gain Ratio(GR) feature selection method with K-Nearest Neighbor(KNN) is 87.375% accuracy. Relief feature selection method with Random Forest(RF) is 86.625% accuracy. Information Gain(IG) feature selection method with Random Forest(RF) is 81.125% accuracy. Information Gain(IG) feature selection method with Random Forest(RF) tend to have the lowest accuracy value in this dataset.
Gain Ratio(GR) feature selection method with K-Nearest Neighbor(KNN) gives the best accuracy 87.375%.
Table VII: Comparison of FS With Classifiers
Method
Accuracy(%)
GA + KNN
84.875
GR + KNN
87.375
Relief + RF
86.625
IG + RF
81.125
Fig 5: Comparison of accuracy of feature selection
V. Conclusion
This research aims to develop a model to classify student performance. In our experiment , K-Nearest Neighbor, Naïve Bayes, Bagging , Random forest, and J48 Decision Tree classification algorithms were applied to genetic algorithms, gain ratio, relief and information gain feature selection method. The experimental result had shown that the performance of student calculation model greatly depends on the choice of collection of most related attribute from the list of attribute used in student dataset. Gain Ratio method with K-Nearest Neighbor classifier shown the best accuracy among the other methods.
For the future work, we will apply more feature selection algorithms and also works on optimization algorithms with large datasets.
VI. Acknowledgment
Thanks to our Supervisor, who inspired us with research on this interesting area and for all his helpful tips. The authors thanks to North Western University, Khulna for providing the student data.
VII. References
[1] A. G. Sagardeep Roy, “Analyzing Performance of Students by Using Data Mining Techniques,” in 4th IEEE Uttar Pradesh Section International Conference on Electrical, Computer and Electronics (UPCON) , Mathura, 2017.
[3] J.K.Jothi and K.Venkatalakshmi, “Intellectual performance analysis of students by using data mining techniques”, International Journal of Innovative Research in Science, Engineering and Technology, vol 3, Special iss 3, March 2014.
[4] T. B. A. N. A. S. I. H. Kartika Maharani, “Comparison Analysis of Data Mining Methodology and Student Performance Improvement Influence Factors in Small Data Set,” in International Conference on Science in Information Technology (ICSITech), 2015.
[5] J. M. Valencia-Ramirez, J. Raya, J. R. Cedeno, R. R. Suarez, H. J. Escalante, and M. Graff, “Comparison between Genetic Programming and full model selection on classification problems”, Power, Electronics and Computing(ROPEC), IEEE International Autumn Meeting, pp.1-6, 2014.
[6] M.S. Mythili1 and A.R.Mohamed Shanavas , “An analysis of students’ Performance using classification algorithms ”, IOSR-JCE, Volume 16, iss1, Jan. 2014.
[8] OTOBO Firstman Noah, BAAH Barida and Taylor Onate Egerton, “Evaluation of student performance using data mining over a given data space”, International Journal of Recent Technology and Engineering (IJRTE) ISSN: 2277-3878, Volume-2, iss 4, September 2013.
[9] Brijesh Kumar Baradwaj and Saurabh Pal, “Mining educational data to analyze Ssudents’ performance”, (IJACSA) International Journal of Advanced Computer Science and Applications, Vol. 2, No. 6, 2011.
[10] V.Ramesh, P.Parkavi and P.Yasodha, “Performance analysis of aata mining techniques for placement chance prediction”, International Journal of Scientific and Egineering Research , Vol.2, iss 8, August 2011.
Essay Writing Service Features
Our Experience
No matter how complex your assignment is, we can find the right professional for your specific task. Contact Essay is an essay writing company that hires only the smartest minds to help you with your projects. Our expertise allows us to provide students with high-quality academic writing, editing & proofreading services.Free Features
Free revision policy
$10Free bibliography & reference
$8Free title page
$8Free formatting
$8How Our Essay Writing Service Works
First, you will need to complete an order form. It's not difficult but, in case there is anything you find not to be clear, you may always call us so that we can guide you through it. On the order form, you will need to include some basic information concerning your order: subject, topic, number of pages, etc. We also encourage our clients to upload any relevant information or sources that will help.
Complete the order formOnce we have all the information and instructions that we need, we select the most suitable writer for your assignment. While everything seems to be clear, the writer, who has complete knowledge of the subject, may need clarification from you. It is at that point that you would receive a call or email from us.
Writer’s assignmentAs soon as the writer has finished, it will be delivered both to the website and to your email address so that you will not miss it. If your deadline is close at hand, we will place a call to you to make sure that you receive the paper on time.
Completing the order and download