Using K-means For Customer Segmentation In E-commerce

Research Objective

Research Papaer

Use of K-means algorithm in customer Segmentation

Introduction

In data mining the clustering technique is considered as a critical step. Statistically it is a multivariate procedure which is suitable for different type of segmentation process for a given data set and researches. This research paper contributes to the use of the k-means clustering technique using the Python programming language for a selected e-commerce data set.

For a business, market or customer segmentation can be defined as a process that helps in dividing its customer base into multiple homogeneous groups that consist of users who shares similar buyer characteristics. This attributes includes their buying habits, interests in products, product preferences and so on (Al-Wakeel & Wu , 2016). The customer segmentation is considered as the most fundamental step in strategic planning for acquiring the potential customers as well as retaining the customers. The segmentation of the buyers is mainly done under different categories depending upon purchasing capabilities as well as interest.

With the present rate of growth in the e-commerce the data clustering process guarantee to convey effective answers for a significant number of the issues emerging from the interaction of customers with the expanding volume of data in generated due to the interactions and buying products from the websites (Kwac, Flora and Rajagopal ,2014). By the term clustering we mean the unsupervised procedure through which huge amount of data are segmented into homogenous as well as disjoint groups of data. This segmentation is done depending upon the similarity between the attributes. The K-means algorithm is used for other application too such as decision making and pattern classification.

The following paper contributes to the discussion about the processing of data set, analysis of data set using the k-means clustering algorithm. In addition to that, the discussion on the findings through the analysis of dataset and

Related work customer segmentation using K-means algorithm

In their paper, the authors Kwac, Flora and Rajagopal (2014) stated that, clustering has the ability and is effective in finding subtle yet strategic connections covered inside a unlabelled datasets. This type off analysis is done under unsupervised learning. There are many clustering algorithms which incorporates Self-Organizing map (SOM), k-Means clustering, k-Nearest Neighbour clustering and so on (Al-Wakeel & Wu , 2016). The above mentioned clustering techniques are very useful as having no information of the dataset beforehand these algorithms are able to recognizing clusters in the given dataset by rehashed correlations of the input designs until the steady clusters in the data set are achieved.

Each cluster contains data points that have similarities in any case, vary significantly from information purposes of different clusters. Clusters has colossal applications in image analysis, pattern recognition and so on. In their paper they tried to find out segments using the k-Means algorithm through the MATLAB code.

The authors discussed the implementation of the K-means algorithm in the following way, K-means clustering, which is one innovation basing on center point of mass, takes as the info parameter, then separate information point question sets into gatherings. The reason for clusters is to make the between aggregate comparability most astounding, be that as it may, the intra-gather comparability most reduced. Likeness of clusters can be estimated by mean estimations of articles in gatherings, which can be regarded as the center point of gathering

Methodology

On the other hand, the researchers like Dhanachandra, Manglem & Chanu, (2015), found that, the k-means clustering techniques is a non-hierarchical and partitional data clustering strategy reasonable for classification process of huge amount of data into multiple patterns or clusters. It is the easiest and most generally utilized for data analysis that utilizes the squared error criteria for determining the clusters.

For a given data set consisting of numeric items and a another integer , it ascertains a segment of examples in k number of clusters. This procedure happens in an iterative way beginning from an arbitrary partition and continuing until finding out a segment of n that limits the inside group of aggregate of squared errors in the process (Dhanachandra, Manglem & Chanu, 2015). The k-means is calculated in in four stages:

Determining the k cluster centres to agree with k arbitrarily picked examples or k randomly characterized center points inside the data set that contains the example set.
finding and assigning each of the patters to the nearest cluster (group mean).
Re-evaluation of the of the cluster centres utilizing the present cluster elements.
Calculation of the convergence of the different clusters.

Analysis of dataset

Description of data

We have selected aa data set that refers to customers of oline distributor. It includes the annual spending in monetary units for different types of products. Mainly consist of 8 columns and 350 rows of data that will be used for analysis

Justification for selecting K-means algorithm over the DBSCAN

In order to segment the customer from the selected dataset we have selected the K-means algorithm. Primary reason behind this selection of the algorithm is clustering helps in creating groups depending on typically used continuous variables (Al-Wakeel & Wu , 2016). As in this project we are trying to create different customer groups depending upon the different attributes, therefore in this scenario clustering can be very helpful in order to find the boundaries between multiple groups from the selected dataset.

As in this project we are going to work with multiple dependent variable of interest from the data set. The variables are generally considered as the input variable in the analysis. The clusters after the analysis can be inferred in light of the selected variables (Maldonado, Carrizosa & Weber, 2015).

The K-means clustering is also very useful in exploratory analysis too. This clustering technique also helps in finding out the picture of typical customer characteristics from the selected dataset.

In addition to that, Homogeneity is also an important factor while considering the cluster analysis. In case of K-means clustering variances among the resulting group from the analysis are fond to be very small. On the contrary, in case of rule-based segmentation process the resulting groups consist of customers who are actually very different depending on the attributes from each other.

Also it provides dynamic clustering results as the clusters definitions are changed with every iteration or time the algorithm runs. In case real time data clustering it ensures that the resultant groups from the analysis always reflects the current state of the data which is analysed through the clustering process.

On the other hand, DBSCAN or the Density-based clustering algorithms discover clusters or areas with high densities which are isolated by low density areas that may lead to the confusion of the explicit determination of the clusters of the buyers (Dhanachandra, Manglem & Chanu, 2015). The density based spatial clustering of utilizations with noise algorithms classifies clusters every accessible point as center focuses, outskirt focuses, and also noise points.

Data Preprocessing

Center points are those that have in any event Minpt number of focuses in the e distance. Fringe focuses can be characterized as focuses that are not center focuses, be that as it may, are the neighbours of center focuses. Commotion focuses are those that are neither center focuses nor outskirt focuses.

Pre-processing on the dataset

For the selected data set, we have selected the data available at archive.ics.uci.edu having title as “Wholesale Customer data”.

The data set includes 440 rows for this assignment we reduced the number of rows to 348 rows. Before analysing the dataset, we pre-processed the data for better results. For the data set following is the statistical result,

	Channel	Region	Fresh	Milk	Grocery	Frozen	DetergentsPaper	Delicatessen
count	348.000000	348.000000	348.000000	348.000000	348.000000	348.000000	348.000000	348.000000
mean	1.324713	2.422414	12027.491379	5583.442529	7762.818966	3091.054598	2814.166667	1576.672414
std	0.468942	0.829751	13143.047477	7131.588022	9206.970951	5100.776773	4654.992536	3109.744474
min	1.000000	1.000000	3.000000	55.000000	3.000000	33.000000	3.000000	3.000000
25%	1.000000	2.000000	2916.000000	1471.500000	2141.250000	779.000000	261.500000	408.000000
50%	1.000000	3.000000	8305.000000	3539.500000	4725.000000	1456.500000	771.000000	900.500000
75%	2.000000	3.000000	16850.500000	7190.250000	10550.000000	3505.250000	3971.500000	1795.750000
max	2.000000	3.000000	112151.000000	73498.000000	92780.000000	60869.000000	40827.000000	47943.000000

From the above table it can be stated that, there are total 348 rows and the rows includes the monetary value spend on the different products such as milk, grocery, frozen, detergents_paper, delicatessen.

For the products, we found that, we have the following statistical data,

Product (mean, std, min,max)

Fresh (12027, 13143.047477, 3.000000, 112151)

Milk (5583.442529, 7131.588022, 55, 73498)

Grocery (7762.818966, 9206.970951, 3, 92780)

Frozen (3091.054598, 5100.776773, 33 ,60869)

DetergentsPaper(2814.166667, 4654.992536, 3 , 40827)

Delicatessen (1576.672414, 3109.744474, 3.000000, 47943)

At this stage we tried to explore some details for some of the arbitrarily selected customers by subtracting the mean and median values from the purchases of the customers which results in something like the following,

Fresh Milk Grocery Frozen Detergents_Paper

4198.0 -3758.0 -5998.0 -2238.0 -2644.0

-11455.0 4180.0 14419.0 -870.0 2068.0

10294.0 -2367.0 -6316.0 -883.0 -2636.0

Delicatessen

0 -510.0

1 986.0

2 1025.0

Here from the above table we found that for the arbitrarily selected customer 1, it buys more than average in fresh, products on the other hand the customer 2 purchases better Milk and frozen products and Delicatessen. At the end the customer 3 buys Fresh and more Delicatessen compared to the other two customers.

In the next stage in order to find out the relation between the features of the data set we tried to find out the distribution for a given feature in the data set which is depicted below,

From the above scatter plot it is evident that, Grocery element and the Detergent _paper elements have the highest correlation between them in the selected dataset. As the distribution is mostly right skewed thus it can be said that by observing the data set, high spending on the frozen items cannot be paired with the higher rate of fresh food purchases.

In order to avoid the skewness of the dataset, we tried to use non liner scaling of the features using the natural logarithm.

Analysis of the Result

One of the measurements that is normally used to look at comes about crosswise over various estimations of K is the mean separation between information focuses and their group centroid. Since expanding the quantity of bunches will dependably diminish the separation to information centres, expanding K will dependably diminish this metric, to the extraordinary of achieving zero when K is the same as the quantity of information center (Dhanachandra, Manglem & Chanu, 2015). In this manner, this metric cannot be utilized as the sole target.

From the processed data we got the following clusters marked with the black outline.

From the above cluster analysis, it can be stated that, each cluster depicted in the figure has a central point. These focuses (or means) are not particularly information focuses from the information, but instead the midpoints of the considerable number of information focuses anticipated in the individual groups. For the issue of making customer segments, a clusters central point relates to the average customer of the cluster (Maldonado, Carrizosa & Weber, 2015). Since the information is right now decreased in measurement and scaled by a logarithm, we can recuperate the delegate customers spending from these information focuses by applying the backwards changes.

It can additionally be joined with other choice strategies which can be constructed viably over it, permitting the customer to rethink his criteria and inclinations in light of the groups processed. Besides the customer require not uncover his buying procedure but rather utilize the arrangement created to shape it (Maldonado, Carrizosa & Weber, 2015).

The customer can consolidate information things from different sources, channel them utilizing the range seek and characterize them, subsequently having the capacity to coordinate item lists from various providers.
The online store requires not keep data with respect to the customer separated from his inclination rectangle furthermore, last buy choices, upgrading along these lines moral calculates, for example, obscurity buying, catching just changes in client inclinations per session.
For the situation of versatile processing, the calculation portrayed in this paper can be overhauled by the online shop server bunch, and the client can recover the sifted data in his cell phone through incremental steps, or apply basic leadership programming to shape his customized buying criteria and inclination.

Possible future work

For this project we have used a small dataset in order to implement the cluster analysis using k-means algorithm. In future this work may be extended to implement the same algorithm on larger dataset to improve the proficiency of the developed algorithm.

Conclusion

The target of customer segmentation is precisely anticipating the requirement of the customers so that the organizations can retain the customers. Consequently, the organizations can enhance the productivity of the business and profit from it by obtaining or fabricating items in right amount at time for the loyal customers at an optimum cost.

In order to meet these stringent prerequisites k-means clustering strategy can be very helpful for appropriate forecasting of the business furthermore deterring the business strategies for the future. Through this process it is conceivable to order classify the brands, items, durability, utility, convenience and so on with clustering process. For instance, through this process it can be determined that which brands are grouped together as far as customer buying patterns includes some specific brands at once.

Turn in your highest-quality paper
Get a qualified writer to help you with

“ Using K-means For Customer Segmentation In E-commerce ”

Get high-quality paper

NEW! AI matching with writer