Market Segmentation And Customer Segmentation Using K-means Clustering On Online Retail Dataset

Problem statement

This report aims to provide a detailed analysis of market segmentation based on a secondary dataset. The dataset used here is transnational information of non-store online retail which is based in United Kingdom (UK). It contains data related to all the transactions from 01/12/2010 to 09/12/2011. The data provides information on Invoice Number, Stock Code, Product Description, Quantity, Invoice Date, Unit Price, Customer ID, and Country. Market segmentation indicates a tool that includes a list of strategies which help to define the market along with the allocating resources with the help of statistical analysis technique namely, cluster analysis (Professional, 2018). More specifically, the non-hierarchical statistical analysis technique K-means clustering is used here to perform a market segmentation, specifically a customer segmentation using statistical analysis software “R”. Customer segmentation is a part of market segmentation where the customers of a company are segmented or divided by certain similarities among the customers. Customer segmentation means grouping of customers where the members of any group are homogeneous within the group and heterogeneous among the groups. This segmentation will help the marketers to understand the preferences of the customers, and it will increase the customer value in the business (Optimove, 2018). Customer segmentation will ultimately help the marketers to maximise the revenue from each customer (Optimove, 2018). The paper displays the customer segmentation on the basis of the Psychographic parameter and takes into account the “Monetary value” of the products purchased by the customers to segment the customers (Moon, 2018). This Monetary value is the product of the values of “Unit Price” variable and “Quantity” variable. This segmentation will help to understand the preferences, purchasing power, and life-style of the customers.

The chosen topic of this paper is to perform a customer segmentation analysis. To meet the requirement, a secondary transaction related dataset is chosen from the internet. The dataset is of multivariate, sequential in nature. The “Monetary value” of the products purchased by the customers is calculated as the product of “Quantity” and “Unit Price” variables from the dataset. After that, the customers are clustered by the “Monetary value” such that each cluster will hold the information of the customers who has purchased products of equal monetary value. In other words, the aim is to segment the customers into clusters that helps to represent certain characteristics. In this way, the researcher can show infinite number of characteristics. However, the optimal number of characteristics depends upon the objective of the business.

The rationale behind performing this analysis is to help the marketers to identify the customers in the most effectual manner (Optimove, 2018). It also helps the customer to study distinct groups of customers at a greater degree of accuracy on the ground of Psychographic factors. This customer segmentation is performed with the help of an unsupervised clustering technique. The k-means clustering technique is preferred in the course of the study as the total number of observations taken into consideration is more than 100.

The study is based on a practical dataset of an online retail non-store of UK and the maximum customers are wholesalers. The data is collected based on the transactions occurred from the date 01/12/2010 to the date 09/12/2011.

Significance and originality of the study

The dataset contains behavioral data based on the transaction of products of an online retail non-store (Archive.ics.uci.edu, 2018). There are eight variables in the dataset –

Invoice No.
Stock Code

Description

Quantity
Invoice Date
Unit Price

Customer ID
Country

The variables that have been used for the calculation and coding are- Customer ID, Quantity and Unit Price.

After collecting the dataset, the dataset is prepared for doing the necessary calculation. The values of “Quantity” and “Unit Price” are multiplied to evaluate the value of “Monetary Value”.

K-means clustering technique is special type of non-hierarchical clustering which helps to find out an object grouping that maximizes or minimizes assessing criterion of interest. The following steps are used to perform the K-means clustering (Mb3is.megx.net, 2018).

Customers are partitioned randomly into non-empty k clusters.
Calculation for centroid of every cluster is done

Then, customer IDs are assigned to different groups, called clusters, depending on the difference from all the centroids.

The previous step is repeated until any change occurs in this assignment.
The procedure gets finished when there is no change in this assignment (Mb3is.megx.net, 2018).

K-means clustering technique is most effective when the number of observations are huge as this technique provides faster computation for small “k” value.

The objective of this method is to mention the number of clusters using the k-means clustering technique. First, the number of clusters should be checked and then the optimal number of clusters is selected (Mb3is.megx.net, 2018).

The entire calculation, evaluation, and graphical representation is done on the statistical software “R” (version 3.5.0).

The following table (Table 1) shows the list of libraries used to perform this analysis on this open-source software.

Table 1

library(factoextra)

library(fpc)

library(NbClust)

library(clValid)

library(magrittr)

library(clustertend)

library(cluster)

library(plyr)

library(XLConnect)

This is a study, which is based on market segmentation. Customer segmentation is a part of the market segmentation. Thus, the customer segmentation of a retail store has been considered. The important part of the analysis is to identify the optimal number of clusters. To specify the number, some kind of metrics is used in the analysis. The optimal number clusters is three. In order to reach to this identification of the number of clusters, cleaning of the data collected is necessary.

There are no demographic and psychographic information about the customers present in the dataset. In the presence of these factors, the segmentation would have been done accordingly. In the absence of these demographic and psychographic factors, segmentation will be done so as to obtain the customers of higher and lower values. This will be helpful for the retail store fir the purpose of marketing.

Thus, it was decided from the dataset that the study will be performed with the metrics of the recency of the last purchase, frequency of purchase and the monetary value of the purchase. These three metrics are also known as the RFM metrics. These are the most important metrics for customer segmentation for the purpose of marketing. After the evaluation of these metrics from the dataset, the segmentation will be done with the help of k-means clustering technique. This technique will be used as this technique is helpful in dealing with large datasets and gives solutions quite quickly.

The steps that has been followed to model the k-means clustering is stated in this section. The dataset is a very large dataset and it contains information about a UK based retail store. The store has outlets in various countries spread all around the globe. The dataset obtained also contains some missing values which will be eliminated at the beginning of the study’ The R-Codes that will be used to remove the missing data from the dataset is given as follows:

K-means clustering

Table 2: R-Codes to remove Missing Data

length(unique(data$CustomerID)) # Determines the number of unique customer IDs in the data

sum(is.na(data$CustomerID))

data <- subset(data, !is.na(data$CustomerID))

The dataset is quite large and for the simplicity of the analysis the data will be segregated to smaller units of recent times. Thus, the data on 9^th December 2010 has only been considered for this study. Further, it has been observed that preferences of the customers vary with respect to their location. Thus, the data is further extracted to one specific geographic unit. Since the retail store is mainly UK based, thus, the location has been restricted to United Kingdom only. The R-codes that has been run to extract the data is presented in the following table:

Table 3: R-Codes for Data Extraction

range(data$InvoiceDate)

data <- subset(data, InvoiceDate >= “2010-12-09”)

range(data$InvoiceDate)

table(data$Country)

data <- subset(data, Country == “United Kingdom”)

As already discussed, the recency and the frequency variables are not present in the dataset. These variables need to be evaluated from the data present in the dataset. While in a retail store, people purchase items as well as return items that are not satisfactory to them. Now, in order to calculate he recency and the frequencies of purchases, it is important to distinguish the purchase invoices from the return invoices. The necessary R-Codes for this distinguishing are given in the following table:

Table 4: R-Codes for distinguishing between the purchase invoices and the return invoices

data$item.return <- grepl(“C”, data$InvoiceNo, fixed=TRUE)

data$purchase.invoice <- ifelse(data$item.return==”TRUE”, 0, 1)

The recency of the variables indicates the number of days that have passed since the customer last made a purchase. The smaller the value of the recency variable, the most recent the customer has purchased a product. Frequency variable indicated the number of purchases from that departmental store in the year and the monetary value indicates the amount spent by the customers on some products during the year. The negative monetary value indicates that the customer has returned some item that he had purchased in the previous. Such cases have been recoded as zero. The necessary codes are provided in the following table:

Table 5: Evaluation of RFM Variables

##—————Creation of Customer-Level Dataset—————-##

customers <- as.data.frame(unique(data$CustomerID))

names(customers) <- “CustomerID”

##———————–Evaluation of Recency Variable————————–##

data$recency <- as.Date(“2011-12-10”) – as.Date(data$InvoiceDate)

# remove returns so only consider the data of most recent *purchase*

temp <- subset(data, purchase.invoice == 1)

# Obtain # of days since most recent purchase

recency <- aggregate(recency ~ CustomerID, data=temp, FUN=min, na.rm=TRUE)

remove(temp)

# Add recency to customer data

customers <- merge(customers, recency, by=”CustomerID”, all=TRUE, sort=TRUE)

remove(recency)

customers$recency <- as.numeric(customers$recency)

##———————–Evaluation of Frequency Variable————————–##

customer.invoices <- subset(data, select = c(“CustomerID”,”InvoiceNo”, “purchase.invoice”))

customer.invoices <- customer.invoices[!duplicated(customer.invoices), ]

customer.invoices <- customer.invoices[order(customer.invoices$CustomerID),]

row.names(customer.invoices) <- NULL

# Number of invoices/year (purchases only)

annual.invoices <- aggregate(purchase.invoice ~ CustomerID, data=customer.invoices, FUN=sum, na.rm=TRUE)

names(annual.invoices)[names(annual.invoices)==”purchase.invoice”] <- “frequency”

# Add # of invoices to customers data

customers <- merge(customers, annual.invoices, by=”CustomerID”, all=TRUE, sort=TRUE)

remove(customer.invoices, annual.invoices)

range(customers$frequency)

table(customers$frequency)

# Remove customers who have not made any purchases in the past year

customers <- subset(customers, frequency > 0)

##———————–Evaluation of Customer’s Monetary Value————————–##

# Total spent on each item on an invoice

data$Amount <- data$Quantity * data$UnitPrice

# Aggregated total sales to customer

annual.sales <- aggregate(Amount ~ CustomerID, data=data, FUN=sum, na.rm=TRUE)

names(annual.sales)[names(annual.sales)==”Amount”] <- “monetary”

# Add monetary value to customers dataset

customers <- merge(customers, annual.sales, by=”CustomerID”, all.x=TRUE, sort=TRUE)

remove(annual.sales)

# Identify customers with negative monetary value numbers, as they were presumably returning purchases from the preceding year

hist(customers$monetary)

customers$monetary <- ifelse(customers$monetary < 0, 0, customers$monetary) # reset negative numbers to zero

hist(customers$monetary)

According to the concept of the Pareto Principle, it is known that from 20 percent of the causes, 80 percent of the results are obtained. This principle can also be applied to the current study. It can be said according to the Pareto Principle that the top 20 percent if the customers of the retail store will be conducting 80 percent of the purchases. These 20 percent customers are thus known as the high value customers who are very important for the business and the business would want to protect these 20 percent customers. The R-Codes for this 80-20 segmentation is given in the following table:

Table 6: R Codes for 80-20 Segmentation

customers <- customers[order(-customers$monetary),]

# Application of the Pareto Principle (80/20 Rule)

pareto.cutoff <- 0.8 * sum(customers$monetary)

customers$pareto <- ifelse(cumsum(customers$monetary) <= pareto.cutoff, “Top 20%”, “Bottom 80%”)

customers$pareto <- factor(customers$pareto, levels=c(“Top 20%”, “Bottom 80%”), ordered=TRUE)

levels(customers$pareto)

round(prop.table(table(customers$pareto)), 2)

remove(pareto.cutoff)

customers <- customers[order(customers$CustomerID),]

The next step would be to preprocess the data and make it ready to fit the k-means clustering model. Continuous variables are required for the k-means clustering model and is best modelled if the variables are standardized and normally distributed. In order to meet this criterion, the input variables are transformed to remove the skewness and standardize them. The codes required for this transformation is attached in the following table

Table 7: R codes for log transformation and Standardization to normal Variables

# Log-transform positively-skewed variables

customers$recency.log <- log(customers$recency)

customers$frequency.log <- log(customers$frequency)

customers$monetary.log <- customers$monetary + 0.1 # can’t take log(0), so add a small value to remove zeros

customers$monetary.log <- log(customers$monetary.log)

# Z-scores

customers$recency.z <- scale(customers$recency.log, center=TRUE, scale=TRUE)

customers$frequency.z <- scale(customers$frequency.log, center=TRUE, scale=TRUE)

customers$monetary.z <- scale(customers$monetary.log, center=TRUE, scale=TRUE)

Decisions

After all these transformations, the data is now ready for the clustering. It is still unknown that how many clusters will be appropriate for the data, what is the number of clusters that the needs to be segmented into. In order to determine the number of clusters that will be appropriate for segmentation, the following codes have been run. A histogram of the process is also attached alongwith. The highest frequency of the histogram will indicate the number of clusters.

Table 8: Determination of the Number of Clusters

library(NbClust)

set.seed(1)

nc <- NbClust(preprocessed, min.nc=2, max.nc=7, method=”kmeans”)

table(nc$Best.n[1,])

nc$All.index # estimates for each number of clusters on 26 different metrics of model fit

barplot(table(nc$Best.n[1,]),

xlab=”Number of Clusters”, ylab=”Number of Criteria”,

main=”Number of Clusters Chosen by Criteria”)

remove(preprocessed)

The R-Code that is run for the clustering analysis is attached in the following table:

Table 8: R-Codes for k-means Clustering

colors <- c(‘red’,’orange’,’green3′,’deepskyblue’,’blue’,’darkorchid4′,’violet’,’pink1′,’tan3′,’black’)

library(car)

library(rgl)

scatter3d(x = customers$frequency.log,

y = customers$monetary.log,

z = customers$recency.log,

groups = customers$cluster_5,

xlab = “Frequency (Log-transformed)”,

ylab = “Monetary Value (log-transformed)”,

zlab = “Recency (Log-transformed)”,

surface.col = colors,

axis.scales = FALSE,

surface = TRUE, # produces the horizonal planes through the graph at each level of monetary value

fit = “smooth”,

# ellipsoid = TRUE, # to graph ellipses uses this command and comment out “surface = TRUE”

grid = TRUE,

axis.col = c(“black”, “black”, “black”))

remove(colors)

In the previous section, the whole model has been discussed along with the codes. After the extraction of the data to one date and one country, it has been observed that the number of unique invoices are 19,140 with 3,891 unique customers. By evaluating the recency, the frequency and the monetary values of the purchases, three extra columns have been added to the dataset. The clustering analysis will be based on the data on these three newly added columns only.

Application of the Pareto 80-20 rule has given the result that 80 percent of the purchases are coming from the top 29 percent of the customers. This has been violating the Pareto Rule. But the result of 29 percent is not that different from 20 percent. Thus, it will be assumed that the pareto rule has been satisfied. It also illustrates that a very small segment has been producing most of the values.

By running this code for determining the number of clusters, the following histogram has been obtained. It can be seen from the histogram that the highest frequency has been observed in the third cluster. Thus, the data can be segmented into three different clusters.

The clusters that the present data has been segmented is illustrated with the help of the following diagram:

With the help of this model development, it has been established that the top 29 percent of the customers are providing most of the sales to the retail store. Thus, the business can develop some facilities such as discounts to the customers who are so important to the business. This will increase the sales of the store. The main reason behind this analysis is that with the help of these results, it will be helpful for the store to develop their business according to the results.

The k-means clustering is very useful and mostly used clustering technique to perform segmentation in the data analytics field. However, there are certain li8mitations to this study that are discussed below-

It is difficult to predict the value of the quantity “k”.
Different final clusters are obtained for different partitioning at the initial stage.

If there is difference in the size and if there is different density in the original dataset, then this clustering method doesn’t work well.

It doesn’t work well for a global cluster.

Conclusion

The analysis has been performed on the customer segmentation for the marketing purpose. Customers are segmented according to the high valued and low valued customers to the retail store. It has been observed from the analysis that the high valued customers have been providing the store with 80 percent of their business. Clustering analysis have been performed to identify this segmentation.

The company can introduce some discount offers for the customers who have been offering the company with the most number of sales. This will make the customers buying less to purchase more so that they can avail the discount. This will in turn increase the sales of the company and the profitability will increase. The popularity of the store in the market will also increase.

References

Archive.ics.uci.edu. (2018). UCI Machine Learning Repository: Online Retail Data Set. [online] Available at: https://archive.ics.uci.edu/ml/datasets/Online+Retail [Accessed 1 Jun. 2018].

Mb3is.megx.net. (2018). Non-hierarchical cluster analysis – GUSTA ME. [online] Available at: https://mb3is.megx.net/gustame/dissimilarity-based-methods/cluster-analysis/non-hierarchical-cluster-analysis [Accessed 1 Jun. 2018].

Moon, T. (2018). Customer Segmentation : From Demographics & Psychographics to Predictive Modeling. [online] Brillio.com. Available at: https://www.brillio.com/insights/blog-posts/customer-segmentation-from-demographics-psychographics-to-predictive-modeling [Accessed 1 Jun. 2018].

Optimove. (2018). Customer Segmentation | Optimove. [online] Available at: https://www.optimove.com/learning-center/customer-segmentation [Accessed 1 Jun. 2018].

Professional, M. (2018). Market Segment Analysis. [online] Expertwebprofessionals.com. Available at: https://expertwebprofessionals.com/internet-marketing/market-segment-analysis.html [Accessed 1 Jun. 2018].

Turn in your highest-quality paper
Get a qualified writer to help you with

“ Market Segmentation And Customer Segmentation Using K-means Clustering On Online Retail Dataset ”

Get high-quality paper

NEW! AI matching with writer