Clustering In Data Mining: Types And Techniques

Statement of the problem

Clustering is defined as a group of objects which belong to the same class. This means that alike objects are usually assembled together in one cluster. Another name of clustering is cluster analysis. Beside these two terms, there are other terms related to clustering which are typological analysis, automatic classification, numerical taxonomy and bryology (Bramer, 2017). Data mining is said to be the process of sorting large data sets so as to recognize patterns and to launch the relationships in a certain data set via data analysis; it is the procedure of mining or extracting data relationship from a large amount of data. Other defined data mining as a method used by organizations to crack fresh data into valuable information. Organizations use software to find for patterns in their database. Data mining is adopted by organizations so as to learn more about the organizational customers (Azzalini & Scarpa, 2012).

In an analysis that aims at learning more about the relationship that exists in a certain database needs the understanding of the two main concepts which are clustering and data mining. This research paper will focus on clustering in data mining. To do this, the research paper will focus in the various types of clustering in data mining (Gan, 2016).

Clustering is the most popular concept in data mining. For one extract knowledge in a certain database needs to fist find the similarities between data which one of the main mandate of cluster analysis. Cluster analysis first find the similarity in a database and the groups’ similar data objects into clusters. This means that to perform data mining cluster analysis needs to be done first.

This minor research will have three major research questions;

What is clustering and which are the types of clustering?]
What is data mining?
Why clustering in data mining?

These three research questions will help in uncovering clustering in data mining. In addition it will lead in identifying some of the reasons as to why clustering in data mining is the most common and popular technique in data mining

To start with data mining is widely identified as knowledge discovery. It is the exercise of both traditional and automated data scrutiny techniques so as to discover formerly hidden relationships between data items. It also involves the process of analysis of data stored in a certain data warehouse. One can also define data mining as a non-trivial mining of potentially and implicit useful data or information in a database (Giusti, Ritter, & Vichi, 2014). Figure one shows the process of data mining as iterative knowledge detection process.

Research questions

Figure 1: Data mining (Hemlata Shau, n.d)

As viewed from figure one; it is evident that data mining comprised six major interactive processes which are data selection, data cleaning, knowledge representation ,data integration, pattern evaluation, and data transformation. Data scrubbing or cleaning process can also be referred to as data cleansing; it is a segment where immaterial data or noise data is detached from the collection. Data integration is where numerous data sources are combined to joint source (Abbass, Sarker, & Newton, 2010). Pattern evaluation is the process by which interesting patterns that represents knowledge are identified which must be based on a certain measure. The final phase is the knowledge representation where the useful information is presented to the user. This is where visualization techniques to assist the users interpret and understand data mining results are presented (Klo?sgen, 2002).

Data mining includes four main classes of tasks which are classification, clustering association rule learning, and regression. Clustering which is later discussed in this chapter is main class in data mining. Other classes used results gathered in clustering to get finer details of a useful information. Example classification is a task of just simplifying a recognized structure so as to apply new data. Example an email program such as yahoo can classify emails such as inbox or sent, spam or legitimate. Regression on the other hand just attempt only to fund a task which mockups the data with the least error (Olson, 2015).

Data mining procedure or process is composed of information expression, data mining, data preparation analysis and decision-making. Figure 2 below shows a general process of data mining.

Figure 2: Data mining process (Tan H. , 2012)

Literature Review

From figure two data preparation is a process which contains of dual major procedures which are data collation and data collection. As one can see from the figure data collection is the initial step of data mining process. One of the main duty of data collation is to eliminate noise in the data. In addition, this step is used in eliminating inconsistent data. Data mining step is the core stage of the overall process of data mining. At this stage is where the four tasks of data mining are carried out. From the figure there three major steps which data are; mining method collection, data mining algorithm and data mining (Maloof, 2006). Information expression is the second last step where knowledge information expression technology is used so as to mine knowledge information for the users. Analysis and decision-making, the last step is used to analyze results of the whole process.

Clustering

Clustering, is part of the four classes of data mining is the procedure of determining structures and groups in a certain database. It is used to place elements of data into various related groups. There various techniques of clustering which will discussed later in the chapter which are maximization (EM) clustering and k-means. One of main objective of clustering is to group objects in a similar which is different from other groups. Grouping in clustering is done according to customer preference or logical relationship (Aggarwal, 2016). An example of clustering is shown by the diagram below

Figure 3: Examples of clustering (Archana, 2015)

There four major types of clustering which are exclusive, overlapping, and hierarchical. Exclusive type of cluster analysis is where objects are grouped in an exclusive way. This is done so as to find out if there is a assured datum which belongs to a definite cluster. Overlapping type of clustering uses fuzzy sets so as to classify data (Han, Kamber, & Jian Pei, 2013).

One of the major objective of clustering is decide some of the intrinsic groping is a set of unrelated data or unlabeled data. The major necessities of clustering in data mining are scalability, interpretability, high dimensionality, insensitivity, dealing with various types of attributes, ability to discover clusters with an arbitrary shape, and minimal requirements for domain knowledge to determine input parameters (Kantardzic, 2014).

There are various types of clustering in data mining which are partition clustering, hierarchical, exclusive, overlapping, and complete. Hierarchical type of cluster analysis is also known as nesting type of clustering or hierarchical cluster analysis. This is an algorithm that groups similar objects which are referred to as clusters (Perner, Advances in Data Mining., 2013). A hierarchical type of clustering starts by first treating each and every observation as a separate cluster. It then repeatedly executes two steps. The first step is identifying when two clusters which are closest together. The second step is merging the two most similar steps. These two steps continues until all the clusters are merged together. Figure 4 below illustrates hierarchical type of structure

Figure 4: hierarchical type of clustering (Bock, n.d)

The main output for hierarchical clustering is referred to as a dendrogram; the output usually shows a hierarchical relationship between two main clusters.

Partition clustering is the division of a set of data objects into what database administrators refer to as overlapping clusters like each object is in exactly one subset. In partitioning clustering, objects are classified based on their similarities. Some of the common methods used in this type of clustering are k-means clustering, CLARA algorithm and k-medoids clustering (Tan, Steinbach, & Kumar, 2014).

k-means method is a method of portioning clustering that is mostly used in unsupervised machine learning algorithm to partition a certain dataset into k-groups that is k clusters; in here k represents the number of each group which must be pre-specified by the computer analyst. This categorizes objects into numerous groups so that the objects are within the identical cluster. The very first main phase when using k-means method is to specify the quantity of clusters in k which is then supposed to be generated in the last solution. The method then begins by haphazardly choosing k objects from a data-set which serve as the first centers for the clusters. The selected substances are referred to as centroids or cluster means. The third step is assigning each and every observation to their closest centroid which has to be founded on Euclidean distance between the centroid and object. The four step is that for each and every k-cluster apprise the cluster centroid has to update the cluster centroid which is done by calculating the fresh mean values of every data points in the cluster. The fifth step is to iteratively minimize the total which has to be within sum of square (Wu & Kuma, 2009).

Clustering Large Applications (CLARA) was first highlighted by kaufaman in 1990; it is an extension of k-methoids. CLARA algorithm considers a very small sample of data with a fixed size. It then applies PAM algorithm so as to generate optimal or maximum set of medoids for the data sample. CLARA method reprates the clustering process in a pre-specified number of times so as to minimize the sampling bias. The method follows four main steps which are splitting the randomly selected data set then computing PAM algorithm in each and every subset. Third, is calculating the sum or the means of dissimilarities in the sampled data set and lastly is retaining the sub-dataset for which sum or mean is minimal (King, Cluster analysis and data mining : an introduction, 2015).

k-medoids on the other hand is related to medoidshift algorithm. This method of partitioning type of clustering breaks a dataset up n objects. The term medoid in this case is an object which is within a cluster. The method requires the user to quickly specify the k that is the number of clusters which are to be generated (Wu J. , 2014).

Overlapping type of clustering is used to imitate the point that a data object can concurrently belong to one or more data groups. It uses ambiguous sets of cluster data so that each and every point may belong to 2 or more clusters which have different degrees of membership. In exclusive type of clustering it assigns each and every value to a single cluster. Complete clustering first does a hierarchical type of clustering which uses a conventional of dissimilarities on n objects which are being clustered (Mirkin, 2005).

The concept of clustering first originated from Kroeber and Driver in 1930s. Clustering is not one specific algorithm but a general task which has to be solved. Some of the popular notions of clustering are small distances among the cluster members (Larose & Larose, 2014).

In data mining, clustering is done to organize data into clusters so that one can be able to identify the internal structure of the data. At times in data mining portioning is the goal as it can lead to unforeseen relationship of data. In addition clustering prepares for other artificial intelligent techniques. With clustering processing can lead to discovery in data; that is the reoccurring patterns and topics and the underlying rules. To achieve clustering in data mining a dissimilarity or similarity measure have to be determined so as to cluster the data points on dissimilarity or similarity in the data. The similarity feature of clustering is said to measure the degree to which a certain pair of objects are alike. The dissimilarity feature of clustering on the other hand is a distance measure. This finds the distance between data points or the difference of the points to the cluster. The distance measure include Euclidean distance measure, Cosine distance measure, Taninoto Distance measure, and Squared Euclidean distance measure (Eudeka, 2014)

As highlighted in the previous two chapters the main goal of clustering is group similar objects which are related to each other and must be different from the other objects. Grouping in clustering is according to logical relationship or must be according to the consumer preferences.

From the various types of clustering; the output of any type must be interpretable, this means that the results can be interpreted by anyone and can be usable and comprehensible. Second all the clustering types must have the ability to deal with erroneous or missing data (Maheshwari, 2015). Third, all the various types of clustering must be able to deal with erroneous or missing data. Forth, all the various types of clustering must be able to deal with high dimensional data and low-dimensional data. Forth, as highlighted from literature review all types of clustering must be scalable; that is, they must be able to deal with a very large databases and have that skill to deal with various or different kind of characteristics. Five, all types of clustering must be able to discover or detect cluster of arbitrary shape which should not be bounded to distance measures only (Zanasi, Brebbia, & Ebecken, 2007)

From literature review clustering has three stages which are shown by the figure below

Figure 4: Stages of clustering

The application of clustering in data mining have two major concepts. First, clustering can be separate tool so as to get the data dissemination to observe cluster features. The second concept is that clustering can be used as one of the pre-processing step for other algorithms like classification and features algorithm (Berry & Browne, 2006).

Clustering as it can be viewed from the previous chapters organizes data into clusters which shows an internal structure of the data. The clustering methods are useful in the knowledge discovery in a certain database. Second, from all the tasks of data mining clustering is the key task and it can be done by a number of algorithms. Some of the common types of algorithms used in clustering are partitioning and hierarchical types of algorithms (Chu & Lin, 2005).

Conclusion

In this research paper, it is evident that clustering is done so as to organize data into clusters such that there is low inter-cluster similarity, high intra-cluster similarity, informally, finding natural groupings among objects. The major goal of data mining process as it can be viewed from this research paper is to extract information from a very large database and then transform the extracted information in form that is usable. One of the main point which have been put across by this research paper is that clustering is very essential tool not only in data mining but also in data analysis. The process of clustering as highlighted from this research paper, can be done by a number of algorithm which are partitioning, hierarchical, overlapping and exclusive algorithms.

There three major points that one needs to remember when talking of clustering in data mining. First is that all data objects in a database are treated as one group. Second, when performing a cluster analysis database administrators first partition a set of data into groups. Third, the main advantage of clustering over other tasks in data mining is that assist in singling out some of the useful features which helps in distinguishing the different groups. Lastly, clusters have not yet received a critical breakthrough but in future and the current development of modern technology we will see major breakthrough which will result in adoption of clustering process in data mining.

References

Abbass, H. A., Sarker, R. A., & Newton, C. S. (2010). Data mining : a heuristic approach. Idea Group.

Aggarwal, C. C. (2016). Data mining : the textbook. Cham: New York : Springer.

Archana. (2015). 2015. Retrieved from Slideshare: https://www.slideshare.net/archnaswaminathan/cdm-44314029

Azzalini, A., & Scarpa, B. (2012). Data Analysis and Data Mining : an Introduction. Oxford: Oxford Press.

Berry, M. W., & Browne, M. (2006). Lecture notes in data mining. NewYork: Hackensack.

Bock, T. (n.d). What is Hierarchical Clustering? Retrieved from Displayr: https://www.displayr.com/what-is-hierarchical-clustering/

Bramer, M. (2017). Principles of data mining. London : Springer.

Chu, W. W., & Lin, T. Y. (2005). Foundations and advances in data mining. Berlin: New York : Springer.

Cordeiro, R. L., Faloutsos, C., & Ju?nior, C. T. (2013). Data Mining in Large Sets of Complex Data by Robson L F Cordeiro . London: Springer London.

Eudeka. (2014, July 4th). K means Clustering . Retrieved from Slideshare: https://www.slideshare.net/EdurekaIN/k-means-clustering

Gan, G. (2016). Data clustering in C++ : an object-oriented approach. NewYork.

Giusti, A., Ritter, G., & Vichi, M. (2014). Classification and data mining. Berlin.

Han, J., Kamber, M., & Jian Pei. (2013). Data mining : concepts and techniques. Amsterdam: Amsterdam Press.

Hemlata Shau, s. s. (n.d). A bbrief overview on data mining. ITCTEE, 1-18.

Kantardzic, M. (2014). Data mining : concepts, models, methods, and algorithms by Mehmed Kantardzic. Chicago: IEEE Press.

King, R. S. (2015). Cluster analysis and data mining : an introduction. Virginia.

Klo?sgen, W. (2002). Handbook of data mining and knowledge discovery by Willi Klo?sgen . Oxford: Oxford University Press.

Larose, D. T., & Larose, C. D. (2014). Discovering knowledge in data : an introduction to data mining. John Wiley & Sons.

Maheshwari, A. K. (2015). Business intelligence and data mining. Chicago: Business Expert Press.

Maimon, O., & Rokach, L. (2010). Data mining and knowledge discovery handbook by Oded Maimon . NewYork: Springer.

Maimon, O., & Rokach, L. (2010). Data mining and knowledge discovery handbook by Oded Maimon. New York: Spring Press.

Maloof, M. A. (2006). Machine learning and data mining for computer… by Marcus A Maloof . London: Springer.

Mirkin, B. (2005). Clustering for data mining : a data recovery approach. London: Boca Raton.

Olson, D. L. (2015). Descriptive data mining. Singapore: Springer Nature.

Perner, P. (2013). Advances in Data Mining. London: London: Springer.

Perner, P. (2014). Machine learning and data mining in pattern recognition : 10th International Conference, MLDM 2014, St. Petersburg, Russia, July 21-24, 2014. Proceedings. Springer.

Tan, H. (2012). Knowledge Discovery and Data Mining. Berlin: Springer Berlin Heidelberg.

Tan, P.-N., Steinbach, M., & Kumar, V. (2014). Introduction to data mining by Pang-Nin Tan . Pearson.

Wu, J. (2014). Advances in k-means clustering : a data mining thinking. London: Springer.

Wu, X., & Kuma, V. (2009). The top ten algorithms in data mining. london: CRC Press.

Zanasi, A., Brebbia, C. A., & Ebecken, N. F. (2007). Data mining VIII : data, text and web mining and their business applications. Chicago: WIT press.

Turn in your highest-quality paper
Get a qualified writer to help you with

“ Clustering In Data Mining: Types And Techniques ”

Get high-quality paper

NEW! AI matching with writer