Research Papaer
Use of K-means algorithm in customer Segmentation
Introduction
In data mining the clustering technique is considered as a critical step. Statistically it is a multivariate procedure which is suitable for different type of segmentation process for a given data set and researches. This research paper contributes to the use of the k-means clustering technique using the Python programming language for a selected e-commerce data set.
For a business, market or customer segmentation can be defined as a process that helps in dividing its customer base into multiple homogeneous groups that consist of users who shares similar buyer characteristics. This attributes includes their buying habits, interests in products, product preferences and so on (Al-Wakeel & Wu , 2016). The customer segmentation is considered as the most fundamental step in strategic planning for acquiring the potential customers as well as retaining the customers. The segmentation of the buyers is mainly done under different categories depending upon purchasing capabilities as well as interest.
With the present rate of growth in the e-commerce the data clustering process guarantee to convey effective answers for a significant number of the issues emerging from the interaction of customers with the expanding volume of data in generated due to the interactions and buying products from the websites (Kwac, Flora and Rajagopal ,2014). By the term clustering we mean the unsupervised procedure through which huge amount of data are segmented into homogenous as well as disjoint groups of data. This segmentation is done depending upon the similarity between the attributes. The K-means algorithm is used for other application too such as decision making and pattern classification.
The following paper contributes to the discussion about the processing of data set, analysis of data set using the k-means clustering algorithm. In addition to that, the discussion on the findings through the analysis of dataset and
Related work customer segmentation using K-means algorithm
In their paper, the authors Kwac, Flora and Rajagopal (2014) stated that, clustering has the ability and is effective in finding subtle yet strategic connections covered inside a unlabelled datasets. This type off analysis is done under unsupervised learning. There are many clustering algorithms which incorporates Self-Organizing map (SOM), k-Means clustering, k-Nearest Neighbour clustering and so on (Al-Wakeel & Wu , 2016). The above mentioned clustering techniques are very useful as having no information of the dataset beforehand these algorithms are able to recognizing clusters in the given dataset by rehashed correlations of the input designs until the steady clusters in the data set are achieved.
Each cluster contains data points that have similarities in any case, vary significantly from information purposes of different clusters. Clusters has colossal applications in image analysis, pattern recognition and so on. In their paper they tried to find out segments using the k-Means algorithm through the MATLAB code.
The authors discussed the implementation of the K-means algorithm in the following way, K-means clustering, which is one innovation basing on center point of mass, takes as the info parameter, then separate information point question sets into gatherings. The reason for clusters is to make the between aggregate comparability most astounding, be that as it may, the intra-gather comparability most reduced. Likeness of clusters can be estimated by mean estimations of articles in gatherings, which can be regarded as the center point of gathering
On the other hand, the researchers like Dhanachandra, Manglem & Chanu, (2015), found that, the k-means clustering techniques is a non-hierarchical and partitional data clustering strategy reasonable for classification process of huge amount of data into multiple patterns or clusters. It is the easiest and most generally utilized for data analysis that utilizes the squared error criteria for determining the clusters.
For a given data set consisting of numeric items and a another integer , it ascertains a segment of examples in k number of clusters. This procedure happens in an iterative way beginning from an arbitrary partition and continuing until finding out a segment of n that limits the inside group of aggregate of squared errors in the process (Dhanachandra, Manglem & Chanu, 2015). The k-means is calculated in in four stages:
Analysis of dataset
Description of data
We have selected aa data set that refers to customers of oline distributor. It includes the annual spending in monetary units for different types of products. Mainly consist of 8 columns and 350 rows of data that will be used for analysis
Justification for selecting K-means algorithm over the DBSCAN
In order to segment the customer from the selected dataset we have selected the K-means algorithm. Primary reason behind this selection of the algorithm is clustering helps in creating groups depending on typically used continuous variables (Al-Wakeel & Wu , 2016). As in this project we are trying to create different customer groups depending upon the different attributes, therefore in this scenario clustering can be very helpful in order to find the boundaries between multiple groups from the selected dataset.
As in this project we are going to work with multiple dependent variable of interest from the data set. The variables are generally considered as the input variable in the analysis. The clusters after the analysis can be inferred in light of the selected variables (Maldonado, Carrizosa & Weber, 2015).
The K-means clustering is also very useful in exploratory analysis too. This clustering technique also helps in finding out the picture of typical customer characteristics from the selected dataset.
In addition to that, Homogeneity is also an important factor while considering the cluster analysis. In case of K-means clustering variances among the resulting group from the analysis are fond to be very small. On the contrary, in case of rule-based segmentation process the resulting groups consist of customers who are actually very different depending on the attributes from each other.
Also it provides dynamic clustering results as the clusters definitions are changed with every iteration or time the algorithm runs. In case real time data clustering it ensures that the resultant groups from the analysis always reflects the current state of the data which is analysed through the clustering process.
On the other hand, DBSCAN or the Density-based clustering algorithms discover clusters or areas with high densities which are isolated by low density areas that may lead to the confusion of the explicit determination of the clusters of the buyers (Dhanachandra, Manglem & Chanu, 2015). The density based spatial clustering of utilizations with noise algorithms classifies clusters every accessible point as center focuses, outskirt focuses, and also noise points.
Center points are those that have in any event Minpt number of focuses in the e distance. Fringe focuses can be characterized as focuses that are not center focuses, be that as it may, are the neighbours of center focuses. Commotion focuses are those that are neither center focuses nor outskirt focuses.
Pre-processing on the dataset
For the selected data set, we have selected the data available at archive.ics.uci.edu having title as “Wholesale Customer data”.
The data set includes 440 rows for this assignment we reduced the number of rows to 348 rows. Before analysing the dataset, we pre-processed the data for better results. For the data set following is the statistical result,
Channel |
Region |
Fresh |
Milk |
Grocery |
Frozen |
DetergentsPaper |
Delicatessen |
|
count |
348.000000 |
348.000000 |
348.000000 |
348.000000 |
348.000000 |
348.000000 |
348.000000 |
348.000000 |
mean |
1.324713 |
2.422414 |
12027.491379 |
5583.442529 |
7762.818966 |
3091.054598 |
2814.166667 |
1576.672414 |
std |
0.468942 |
0.829751 |
13143.047477 |
7131.588022 |
9206.970951 |
5100.776773 |
4654.992536 |
3109.744474 |
min |
1.000000 |
1.000000 |
3.000000 |
55.000000 |
3.000000 |
33.000000 |
3.000000 |
3.000000 |
25% |
1.000000 |
2.000000 |
2916.000000 |
1471.500000 |
2141.250000 |
779.000000 |
261.500000 |
408.000000 |
50% |
1.000000 |
3.000000 |
8305.000000 |
3539.500000 |
4725.000000 |
1456.500000 |
771.000000 |
900.500000 |
75% |
2.000000 |
3.000000 |
16850.500000 |
7190.250000 |
10550.000000 |
3505.250000 |
3971.500000 |
1795.750000 |
max |
2.000000 |
3.000000 |
112151.000000 |
73498.000000 |
92780.000000 |
60869.000000 |
40827.000000 |
47943.000000 |
From the above table it can be stated that, there are total 348 rows and the rows includes the monetary value spend on the different products such as milk, grocery, frozen, detergents_paper, delicatessen.
For the products, we found that, we have the following statistical data,
Product (mean, std, min,max)
Fresh (12027, 13143.047477, 3.000000, 112151)
Milk (5583.442529, 7131.588022, 55, 73498)
Grocery (7762.818966, 9206.970951, 3, 92780)
Frozen (3091.054598, 5100.776773, 33 ,60869)
DetergentsPaper(2814.166667, 4654.992536, 3 , 40827)
Delicatessen (1576.672414, 3109.744474, 3.000000, 47943)
At this stage we tried to explore some details for some of the arbitrarily selected customers by subtracting the mean and median values from the purchases of the customers which results in something like the following,
Fresh Milk Grocery Frozen Detergents_Paper
4198.0 -3758.0 -5998.0 -2238.0 -2644.0
-11455.0 4180.0 14419.0 -870.0 2068.0
10294.0 -2367.0 -6316.0 -883.0 -2636.0
Delicatessen
0 -510.0
1 986.0
2 1025.0
Here from the above table we found that for the arbitrarily selected customer 1, it buys more than average in fresh, products on the other hand the customer 2 purchases better Milk and frozen products and Delicatessen. At the end the customer 3 buys Fresh and more Delicatessen compared to the other two customers.
In the next stage in order to find out the relation between the features of the data set we tried to find out the distribution for a given feature in the data set which is depicted below,
From the above scatter plot it is evident that, Grocery element and the Detergent _paper elements have the highest correlation between them in the selected dataset. As the distribution is mostly right skewed thus it can be said that by observing the data set, high spending on the frozen items cannot be paired with the higher rate of fresh food purchases.
In order to avoid the skewness of the dataset, we tried to use non liner scaling of the features using the natural logarithm.
Analysis of the Result
One of the measurements that is normally used to look at comes about crosswise over various estimations of K is the mean separation between information focuses and their group centroid. Since expanding the quantity of bunches will dependably diminish the separation to information centres, expanding K will dependably diminish this metric, to the extraordinary of achieving zero when K is the same as the quantity of information center (Dhanachandra, Manglem & Chanu, 2015). In this manner, this metric cannot be utilized as the sole target.
From the processed data we got the following clusters marked with the black outline.
From the above cluster analysis, it can be stated that, each cluster depicted in the figure has a central point. These focuses (or means) are not particularly information focuses from the information, but instead the midpoints of the considerable number of information focuses anticipated in the individual groups. For the issue of making customer segments, a clusters central point relates to the average customer of the cluster (Maldonado, Carrizosa & Weber, 2015). Since the information is right now decreased in measurement and scaled by a logarithm, we can recuperate the delegate customers spending from these information focuses by applying the backwards changes.
It can additionally be joined with other choice strategies which can be constructed viably over it, permitting the customer to rethink his criteria and inclinations in light of the groups processed. Besides the customer require not uncover his buying procedure but rather utilize the arrangement created to shape it (Maldonado, Carrizosa & Weber, 2015).
Possible future work
For this project we have used a small dataset in order to implement the cluster analysis using k-means algorithm. In future this work may be extended to implement the same algorithm on larger dataset to improve the proficiency of the developed algorithm.
Conclusion
The target of customer segmentation is precisely anticipating the requirement of the customers so that the organizations can retain the customers. Consequently, the organizations can enhance the productivity of the business and profit from it by obtaining or fabricating items in right amount at time for the loyal customers at an optimum cost.
In order to meet these stringent prerequisites k-means clustering strategy can be very helpful for appropriate forecasting of the business furthermore deterring the business strategies for the future. Through this process it is conceivable to order classify the brands, items, durability, utility, convenience and so on with clustering process. For instance, through this process it can be determined that which brands are grouped together as far as customer buying patterns includes some specific brands at once.
Essay Writing Service Features
Our Experience
No matter how complex your assignment is, we can find the right professional for your specific task. Contact Essay is an essay writing company that hires only the smartest minds to help you with your projects. Our expertise allows us to provide students with high-quality academic writing, editing & proofreading services.Free Features
Free revision policy
$10Free bibliography & reference
$8Free title page
$8Free formatting
$8How Our Essay Writing Service Works
First, you will need to complete an order form. It's not difficult but, in case there is anything you find not to be clear, you may always call us so that we can guide you through it. On the order form, you will need to include some basic information concerning your order: subject, topic, number of pages, etc. We also encourage our clients to upload any relevant information or sources that will help.
Complete the order formOnce we have all the information and instructions that we need, we select the most suitable writer for your assignment. While everything seems to be clear, the writer, who has complete knowledge of the subject, may need clarification from you. It is at that point that you would receive a call or email from us.
Writer’s assignmentAs soon as the writer has finished, it will be delivered both to the website and to your email address so that you will not miss it. If your deadline is close at hand, we will place a call to you to make sure that you receive the paper on time.
Completing the order and download