This report aims to provide a detailed analysis of market segmentation based on a secondary dataset. The dataset used here is transnational information of non-store online retail which is based in United Kingdom (UK). It contains data related to all the transactions from 01/12/2010 to 09/12/2011. The data provides information on Invoice Number, Stock Code, Product Description, Quantity, Invoice Date, Unit Price, Customer ID, and Country. Market segmentation indicates a tool that includes a list of strategies which help to define the market along with the allocating resources with the help of statistical analysis technique namely, cluster analysis (Professional, 2018). More specifically, the non-hierarchical statistical analysis technique K-means clustering is used here to perform a market segmentation, specifically a customer segmentation using statistical analysis software “R”. Customer segmentation is a part of market segmentation where the customers of a company are segmented or divided by certain similarities among the customers. Customer segmentation means grouping of customers where the members of any group are homogeneous within the group and heterogeneous among the groups. This segmentation will help the marketers to understand the preferences of the customers, and it will increase the customer value in the business (Optimove, 2018). Customer segmentation will ultimately help the marketers to maximise the revenue from each customer (Optimove, 2018). The paper displays the customer segmentation on the basis of the Psychographic parameter and takes into account the “Monetary value” of the products purchased by the customers to segment the customers (Moon, 2018). This Monetary value is the product of the values of “Unit Price” variable and “Quantity” variable. This segmentation will help to understand the preferences, purchasing power, and life-style of the customers.
The chosen topic of this paper is to perform a customer segmentation analysis. To meet the requirement, a secondary transaction related dataset is chosen from the internet. The dataset is of multivariate, sequential in nature. The “Monetary value” of the products purchased by the customers is calculated as the product of “Quantity” and “Unit Price” variables from the dataset. After that, the customers are clustered by the “Monetary value” such that each cluster will hold the information of the customers who has purchased products of equal monetary value. In other words, the aim is to segment the customers into clusters that helps to represent certain characteristics. In this way, the researcher can show infinite number of characteristics. However, the optimal number of characteristics depends upon the objective of the business.
The rationale behind performing this analysis is to help the marketers to identify the customers in the most effectual manner (Optimove, 2018). It also helps the customer to study distinct groups of customers at a greater degree of accuracy on the ground of Psychographic factors. This customer segmentation is performed with the help of an unsupervised clustering technique. The k-means clustering technique is preferred in the course of the study as the total number of observations taken into consideration is more than 100.
The study is based on a practical dataset of an online retail non-store of UK and the maximum customers are wholesalers. The data is collected based on the transactions occurred from the date 01/12/2010 to the date 09/12/2011.
The dataset contains behavioral data based on the transaction of products of an online retail non-store (Archive.ics.uci.edu, 2018). There are eight variables in the dataset –
The variables that have been used for the calculation and coding are- Customer ID, Quantity and Unit Price.
After collecting the dataset, the dataset is prepared for doing the necessary calculation. The values of “Quantity” and “Unit Price” are multiplied to evaluate the value of “Monetary Value”.
K-means clustering technique is special type of non-hierarchical clustering which helps to find out an object grouping that maximizes or minimizes assessing criterion of interest. The following steps are used to perform the K-means clustering (Mb3is.megx.net, 2018).
K-means clustering technique is most effective when the number of observations are huge as this technique provides faster computation for small “k” value.
The objective of this method is to mention the number of clusters using the k-means clustering technique. First, the number of clusters should be checked and then the optimal number of clusters is selected (Mb3is.megx.net, 2018).
The entire calculation, evaluation, and graphical representation is done on the statistical software “R” (version 3.5.0).
The following table (Table 1) shows the list of libraries used to perform this analysis on this open-source software.
Table 1
library(factoextra) library(fpc) library(NbClust) library(clValid) library(magrittr) library(clustertend) library(cluster) library(plyr) library(XLConnect) |
This is a study, which is based on market segmentation. Customer segmentation is a part of the market segmentation. Thus, the customer segmentation of a retail store has been considered. The important part of the analysis is to identify the optimal number of clusters. To specify the number, some kind of metrics is used in the analysis. The optimal number clusters is three. In order to reach to this identification of the number of clusters, cleaning of the data collected is necessary.
There are no demographic and psychographic information about the customers present in the dataset. In the presence of these factors, the segmentation would have been done accordingly. In the absence of these demographic and psychographic factors, segmentation will be done so as to obtain the customers of higher and lower values. This will be helpful for the retail store fir the purpose of marketing.
Thus, it was decided from the dataset that the study will be performed with the metrics of the recency of the last purchase, frequency of purchase and the monetary value of the purchase. These three metrics are also known as the RFM metrics. These are the most important metrics for customer segmentation for the purpose of marketing. After the evaluation of these metrics from the dataset, the segmentation will be done with the help of k-means clustering technique. This technique will be used as this technique is helpful in dealing with large datasets and gives solutions quite quickly.
The steps that has been followed to model the k-means clustering is stated in this section. The dataset is a very large dataset and it contains information about a UK based retail store. The store has outlets in various countries spread all around the globe. The dataset obtained also contains some missing values which will be eliminated at the beginning of the study’ The R-Codes that will be used to remove the missing data from the dataset is given as follows:
Table 2: R-Codes to remove Missing Data
length(unique(data$CustomerID)) # Determines the number of unique customer IDs in the data sum(is.na(data$CustomerID)) data <- subset(data, !is.na(data$CustomerID)) |
The dataset is quite large and for the simplicity of the analysis the data will be segregated to smaller units of recent times. Thus, the data on 9th December 2010 has only been considered for this study. Further, it has been observed that preferences of the customers vary with respect to their location. Thus, the data is further extracted to one specific geographic unit. Since the retail store is mainly UK based, thus, the location has been restricted to United Kingdom only. The R-codes that has been run to extract the data is presented in the following table:
Table 3: R-Codes for Data Extraction
range(data$InvoiceDate) data <- subset(data, InvoiceDate >= “2010-12-09”) range(data$InvoiceDate) table(data$Country) data <- subset(data, Country == “United Kingdom”) |
As already discussed, the recency and the frequency variables are not present in the dataset. These variables need to be evaluated from the data present in the dataset. While in a retail store, people purchase items as well as return items that are not satisfactory to them. Now, in order to calculate he recency and the frequencies of purchases, it is important to distinguish the purchase invoices from the return invoices. The necessary R-Codes for this distinguishing are given in the following table:
Table 4: R-Codes for distinguishing between the purchase invoices and the return invoices
data$item.return <- grepl(“C”, data$InvoiceNo, fixed=TRUE) data$purchase.invoice <- ifelse(data$item.return==”TRUE”, 0, 1) |
The recency of the variables indicates the number of days that have passed since the customer last made a purchase. The smaller the value of the recency variable, the most recent the customer has purchased a product. Frequency variable indicated the number of purchases from that departmental store in the year and the monetary value indicates the amount spent by the customers on some products during the year. The negative monetary value indicates that the customer has returned some item that he had purchased in the previous. Such cases have been recoded as zero. The necessary codes are provided in the following table:
Table 5: Evaluation of RFM Variables
##—————Creation of Customer-Level Dataset—————-## customers <- as.data.frame(unique(data$CustomerID)) names(customers) <- “CustomerID” ##———————–Evaluation of Recency Variable————————–## data$recency <- as.Date(“2011-12-10”) – as.Date(data$InvoiceDate) # remove returns so only consider the data of most recent *purchase* temp <- subset(data, purchase.invoice == 1) # Obtain # of days since most recent purchase recency <- aggregate(recency ~ CustomerID, data=temp, FUN=min, na.rm=TRUE) remove(temp) # Add recency to customer data customers <- merge(customers, recency, by=”CustomerID”, all=TRUE, sort=TRUE) remove(recency) customers$recency <- as.numeric(customers$recency) ##———————–Evaluation of Frequency Variable————————–## customer.invoices <- subset(data, select = c(“CustomerID”,”InvoiceNo”, “purchase.invoice”)) customer.invoices <- customer.invoices[!duplicated(customer.invoices), ] customer.invoices <- customer.invoices[order(customer.invoices$CustomerID),] row.names(customer.invoices) <- NULL # Number of invoices/year (purchases only) annual.invoices <- aggregate(purchase.invoice ~ CustomerID, data=customer.invoices, FUN=sum, na.rm=TRUE) names(annual.invoices)[names(annual.invoices)==”purchase.invoice”] <- “frequency” # Add # of invoices to customers data customers <- merge(customers, annual.invoices, by=”CustomerID”, all=TRUE, sort=TRUE) remove(customer.invoices, annual.invoices) range(customers$frequency) table(customers$frequency) # Remove customers who have not made any purchases in the past year customers <- subset(customers, frequency > 0) ##———————–Evaluation of Customer’s Monetary Value————————–## # Total spent on each item on an invoice data$Amount <- data$Quantity * data$UnitPrice # Aggregated total sales to customer annual.sales <- aggregate(Amount ~ CustomerID, data=data, FUN=sum, na.rm=TRUE) names(annual.sales)[names(annual.sales)==”Amount”] <- “monetary” # Add monetary value to customers dataset customers <- merge(customers, annual.sales, by=”CustomerID”, all.x=TRUE, sort=TRUE) remove(annual.sales) # Identify customers with negative monetary value numbers, as they were presumably returning purchases from the preceding year hist(customers$monetary) customers$monetary <- ifelse(customers$monetary < 0, 0, customers$monetary) # reset negative numbers to zero hist(customers$monetary) |
According to the concept of the Pareto Principle, it is known that from 20 percent of the causes, 80 percent of the results are obtained. This principle can also be applied to the current study. It can be said according to the Pareto Principle that the top 20 percent if the customers of the retail store will be conducting 80 percent of the purchases. These 20 percent customers are thus known as the high value customers who are very important for the business and the business would want to protect these 20 percent customers. The R-Codes for this 80-20 segmentation is given in the following table:
Table 6: R Codes for 80-20 Segmentation
customers <- customers[order(-customers$monetary),] # Application of the Pareto Principle (80/20 Rule) pareto.cutoff <- 0.8 * sum(customers$monetary) customers$pareto <- ifelse(cumsum(customers$monetary) <= pareto.cutoff, “Top 20%”, “Bottom 80%”) customers$pareto <- factor(customers$pareto, levels=c(“Top 20%”, “Bottom 80%”), ordered=TRUE) levels(customers$pareto) round(prop.table(table(customers$pareto)), 2) remove(pareto.cutoff) customers <- customers[order(customers$CustomerID),] |
The next step would be to preprocess the data and make it ready to fit the k-means clustering model. Continuous variables are required for the k-means clustering model and is best modelled if the variables are standardized and normally distributed. In order to meet this criterion, the input variables are transformed to remove the skewness and standardize them. The codes required for this transformation is attached in the following table
Table 7: R codes for log transformation and Standardization to normal Variables
# Log-transform positively-skewed variables customers$recency.log <- log(customers$recency) customers$frequency.log <- log(customers$frequency) customers$monetary.log <- customers$monetary + 0.1 # can’t take log(0), so add a small value to remove zeros customers$monetary.log <- log(customers$monetary.log) # Z-scores customers$recency.z <- scale(customers$recency.log, center=TRUE, scale=TRUE) customers$frequency.z <- scale(customers$frequency.log, center=TRUE, scale=TRUE) customers$monetary.z <- scale(customers$monetary.log, center=TRUE, scale=TRUE) |
After all these transformations, the data is now ready for the clustering. It is still unknown that how many clusters will be appropriate for the data, what is the number of clusters that the needs to be segmented into. In order to determine the number of clusters that will be appropriate for segmentation, the following codes have been run. A histogram of the process is also attached alongwith. The highest frequency of the histogram will indicate the number of clusters.
Table 8: Determination of the Number of Clusters
library(NbClust) set.seed(1) nc <- NbClust(preprocessed, min.nc=2, max.nc=7, method=”kmeans”) table(nc$Best.n[1,]) nc$All.index # estimates for each number of clusters on 26 different metrics of model fit barplot(table(nc$Best.n[1,]), xlab=”Number of Clusters”, ylab=”Number of Criteria”, main=”Number of Clusters Chosen by Criteria”) remove(preprocessed) |
The R-Code that is run for the clustering analysis is attached in the following table:
Table 8: R-Codes for k-means Clustering
colors <- c(‘red’,’orange’,’green3′,’deepskyblue’,’blue’,’darkorchid4′,’violet’,’pink1′,’tan3′,’black’) library(car) library(rgl) scatter3d(x = customers$frequency.log, y = customers$monetary.log, z = customers$recency.log, groups = customers$cluster_5, xlab = “Frequency (Log-transformed)”, ylab = “Monetary Value (log-transformed)”, zlab = “Recency (Log-transformed)”, surface.col = colors, axis.scales = FALSE, surface = TRUE, # produces the horizonal planes through the graph at each level of monetary value fit = “smooth”, # ellipsoid = TRUE, # to graph ellipses uses this command and comment out “surface = TRUE” grid = TRUE, axis.col = c(“black”, “black”, “black”)) remove(colors) |
In the previous section, the whole model has been discussed along with the codes. After the extraction of the data to one date and one country, it has been observed that the number of unique invoices are 19,140 with 3,891 unique customers. By evaluating the recency, the frequency and the monetary values of the purchases, three extra columns have been added to the dataset. The clustering analysis will be based on the data on these three newly added columns only.
Application of the Pareto 80-20 rule has given the result that 80 percent of the purchases are coming from the top 29 percent of the customers. This has been violating the Pareto Rule. But the result of 29 percent is not that different from 20 percent. Thus, it will be assumed that the pareto rule has been satisfied. It also illustrates that a very small segment has been producing most of the values.
By running this code for determining the number of clusters, the following histogram has been obtained. It can be seen from the histogram that the highest frequency has been observed in the third cluster. Thus, the data can be segmented into three different clusters.
The clusters that the present data has been segmented is illustrated with the help of the following diagram:
With the help of this model development, it has been established that the top 29 percent of the customers are providing most of the sales to the retail store. Thus, the business can develop some facilities such as discounts to the customers who are so important to the business. This will increase the sales of the store. The main reason behind this analysis is that with the help of these results, it will be helpful for the store to develop their business according to the results.
The k-means clustering is very useful and mostly used clustering technique to perform segmentation in the data analytics field. However, there are certain li8mitations to this study that are discussed below-
Conclusion
The analysis has been performed on the customer segmentation for the marketing purpose. Customers are segmented according to the high valued and low valued customers to the retail store. It has been observed from the analysis that the high valued customers have been providing the store with 80 percent of their business. Clustering analysis have been performed to identify this segmentation.
The company can introduce some discount offers for the customers who have been offering the company with the most number of sales. This will make the customers buying less to purchase more so that they can avail the discount. This will in turn increase the sales of the company and the profitability will increase. The popularity of the store in the market will also increase.
References
Archive.ics.uci.edu. (2018). UCI Machine Learning Repository: Online Retail Data Set. [online] Available at: https://archive.ics.uci.edu/ml/datasets/Online+Retail [Accessed 1 Jun. 2018].
Mb3is.megx.net. (2018). Non-hierarchical cluster analysis – GUSTA ME. [online] Available at: https://mb3is.megx.net/gustame/dissimilarity-based-methods/cluster-analysis/non-hierarchical-cluster-analysis [Accessed 1 Jun. 2018].
Moon, T. (2018). Customer Segmentation : From Demographics & Psychographics to Predictive Modeling. [online] Brillio.com. Available at: https://www.brillio.com/insights/blog-posts/customer-segmentation-from-demographics-psychographics-to-predictive-modeling [Accessed 1 Jun. 2018].
Optimove. (2018). Customer Segmentation | Optimove. [online] Available at: https://www.optimove.com/learning-center/customer-segmentation [Accessed 1 Jun. 2018].
Professional, M. (2018). Market Segment Analysis. [online] Expertwebprofessionals.com. Available at: https://expertwebprofessionals.com/internet-marketing/market-segment-analysis.html [Accessed 1 Jun. 2018].
Essay Writing Service Features
Our Experience
No matter how complex your assignment is, we can find the right professional for your specific task. Contact Essay is an essay writing company that hires only the smartest minds to help you with your projects. Our expertise allows us to provide students with high-quality academic writing, editing & proofreading services.Free Features
Free revision policy
$10Free bibliography & reference
$8Free title page
$8Free formatting
$8How Our Essay Writing Service Works
First, you will need to complete an order form. It's not difficult but, in case there is anything you find not to be clear, you may always call us so that we can guide you through it. On the order form, you will need to include some basic information concerning your order: subject, topic, number of pages, etc. We also encourage our clients to upload any relevant information or sources that will help.
Complete the order formOnce we have all the information and instructions that we need, we select the most suitable writer for your assignment. While everything seems to be clear, the writer, who has complete knowledge of the subject, may need clarification from you. It is at that point that you would receive a call or email from us.
Writer’s assignmentAs soon as the writer has finished, it will be delivered both to the website and to your email address so that you will not miss it. If your deadline is close at hand, we will place a call to you to make sure that you receive the paper on time.
Completing the order and download