E-commerce Customer Segmentation

Problem Definition


During the Covid-19 pandemic in 2020, total sales in e-commerce increased by 37%. Due to increasingly fierce competition between competitors, you and your Product Manager discussed how to stay afloat and compete in e-commerce industry. After that, you choose to create innovation or offers for users to keep choosing you as their online shopping media.

You are assigned to perform transactions-related analysis for new users' data. However, the problem is, our company is doing efficiency in managing promotional funds in 2021. As a Data Analyst, what kind of insights and recommendations you can provide to the company?

Data Set: Download here

  1. Change the year in InvoiceDate column to 2020–2021
  2. UnitPrice column is in USdollar
  3. Country column is unnecessary

In this analysis we’re using Online Retail Dataset consists of 7 columns with the descriptions for each column:

  1. InvoiceNo: Invoice number. Nominal, a 6-digit integral number uniquely assigned to each transaction. If this code starts with letter ‘c’, it indicates a cancellation.
  2. StockCode: Product (item) code. Nominal, a 5-digit integral number uniquely assigned to each distinct product.
  3. Description: Product (item) name. Nominal.
  4. Quantity: The quantities of each product (item) per transaction. Numeric.
  5. InvoiceDate: Invice Date and time. Numeric, the day and time when each transaction was generated.
  6. UnitPrice: Unit price. Numeric, Product price per unit in sterling.
  7. CustomerID: Customer number. Nominal, a 5-digit integral number uniquely assigned to each customer.

Define the Scope

Because company wants efficiency in promotional funds, we’re focusing in analysis which could help promotional strategy.

Business Questions

These are the questions we try to find the answers through analysis:

a. Does the orders have a time pattern? When people most likely to do online shopping?
b. How are the characteristics of our customers’ behavior?

The second question, we try to answer using RFM customer segmentation. We can’t analyze one by one customer’s behavior, but we can group customers based on their purchasing behavior. Then we can choose promotional strategies based on their group behavior.


First, we imported the libraries and the dataset, also doing basic analysis in descriptive statistics.

import pandas as pd
import matplotlib.pyplot as plt
import calendar
import datetime
df = pd.read_csv('data_ecommerce.csv', encoding = 'ISO-8859-1')

The dataset will be like this, it has 8 columns and 541909 rows.

Next, we’re checking the rows with missing values. There are 2 columns with empty rows, Description and CustomerID. There is no problem leaving rows in the Description blank. CustomerID column can’t be impute to anything, so we should drop the empty rows.

print('Missing values:', df.isnull().sum())
df.dropna(axis = 0, subset = ['CustomerID'], inplace = True)
print('Missing values after cleaning:',df.isnull().sum())

Just like definition above we should change the year in InvoiceDate to 2019–2020 and remove Country column. We also change the data type to a proper one, like InvoiceDate from string to datatime type and CustomerID from integer to string or object.

#Change 2010-2011 to 2020-2021
df['InvoiceDate']=df['InvoiceDate'].apply(lambda x: x.replace('2010','2019'))
df['InvoiceDate']=df['InvoiceDate'].apply(lambda x: x.replace('2011','2020'))
#Drop Country column and change the year in InvoiceDate
df = df.drop('Country',axis=1)
#Data event time menjadi datetime, kemudian userid, shopid, orderid merupakan string
timefmt = "%m/%d/%Y %H:%M"
df['InvoiceDate'] = pd.to_datetime(df['InvoiceDate'], format = timefmt, errors='coerce')
#Customer ID is an object
df['CustomerID']=df['CustomerID'].apply(lambda x: str(int(x)))

We also check the duplicated rows and drop it.

print(“Duplicated Values: {}”.format(df.duplicated().sum()))

When we see the descriptive statistics for numerical data, quasuch as Quantity and Unit Price, there is one thing that doesn’t make sense. The minimum Quantity is a negative number, so there are might be more than one orders have quantity below zero. We can ask another department to make sure the meaning of negative quantity, but for this case we drop them.

# drop quantity <= 0
df = df[df['Quantity']>=0]

Now the dataset seems more make sense. We’re also adding one more column called TotalPrice.

# Create TotalPrice column
df['TotalPrice'] = df['Quantity'] * df['UnitPrice']

a) Visualization of Order Patterns

print('First date data recorder: {}'.format(min(df.InvoiceDate)))
print('Last date data recorder: {}'.format(max(df.InvoiceDate)))
df['month_year'] = df['InvoiceDate'].dt.to_period('M')#Bar Plot for Monthly Orders
plt.subplot(1, 2, 1)
plt.title('Monthly Orders')
#Bar Plot for Monthly Revenue
plt.subplot(1, 2, 2)
plt.title('Monthly Revenue')

Because the data was recorded on December 1st, 2019 until December 9th, 2020, monthly data for December 2020 couldn’t represent monthly behavior of orders and revenue. So the data for this month will be ignored. We could see there are significant orders and revenue increase for 3 consecutive months of 2020, on September, October, and November. If we have more data on December 2020, the orders and revenue more likely increase as well. Because in the first 9 days the orders already around one third part of orders in November 2020. Also the revenue already half of November 2020.

b) RFM Segmentation

RFM stands for Recency-Frequency-Monetary Value.

max_InvoiceDate = max(df['InvoiceDate'])
NOW = datetime.datetime(2020,12,10)
RFMtable = df.groupby('CustomerID').agg({
'InvoiceDate': lambda x: (NOW-x.max()).days,
'InvoiceNo': 'count',
'TotalPrice': 'sum'
#Histogram Recency
plt.subplot(3, 1, 1)
plt.title('Recency', fontsize=15)
plt.xlabel('Recency(days)', fontsize=12)
plt.ylabel('Number of Customers', fontsize=12)
#Histogram Frequency
plt.subplot(3, 1, 2)
plt.title('Frequency', fontsize=15)
plt.xlabel('Frequency(orders)', fontsize=12)
plt.ylabel('Number of Customers', fontsize=12)
#Histogram Revenue
plt.subplot(3, 1, 3)
plt.title('Revenue', fontsize=15)
plt.xlabel('Revenue(USD)', fontsize=12)
plt.ylabel('Number of Customers', fontsize=12)

Recency, Frequency, and Revenue histograms are right skewed. There are more people who recently ordered, but the people who the last order would be >100 days ago also scattered quite a lot. Most of the customers rarely placed orders or the frequency is small, but several customers ordered a lot of times, until thousands of times. So there are several loyal customers. Revenue comes from each customers mostly small, may be because each customer tends to order things with small quantity or cheap unit price, but there are also customers who gave a lot revenue for the store.

RFM_revenuehigh = RFMtable[RFMtable.Revenue>50000]
#Histogram Frequency
plt.subplot(2, 1, 1)
plt.title('Frequency from Customers with >50k USD Revenue', fontsize=15)
plt.xlabel('Frequency(orders)', fontsize=12)
plt.ylabel('Number of Customers', fontsize=12)
#Histogram Revenue
plt.subplot(2, 1, 2)
plt.title('Customers with >50k USD Revenue', fontsize=15)
plt.xlabel('Revenue(USD)', fontsize=12)
plt.ylabel('Number of Customers', fontsize=12)

So customers who gave the store large revenues (>50k USD) are only a few, there are customers who only do several orders but in big prices and there are customers who often place orders.

RFM Segmentation using K-Means Clustering

Clustering is a class of unsupervised machine learning models. Clustering aims to group unlabeled objects, objects in the same group means they have similar properties, meanwhile if they’re differentiate with other groups, that means they have different properties. One of algorithm to do clustering or cluster analysis is using K-Means.

The cluster center of K-Means algorithm is the arithmetic mean of all points in the cluster. For reference, I modified several code from here.

from sklearn.cluster import KMeans
def elbow_method(dataframe):
newdf = RFMtable[['dataframe']]
for k in range(1, 11):
kmeans = KMeans(n_clusters=k, max_iter=1000, random_state=0).fit(newdf)
sse[k] = kmeans.inertia_
plt.plot(list(sse.keys()), list(sse.values()))
plt.xlabel("Number of cluster")
return plt.show()
# K-Means on Recency feature
kmeans = KMeans(n_clusters=4, random_state=0)
RFMtable['RecencyClust'] = kmeans.predict(RFMtable[['Recency']])
# K-Means on Frequency feature
RFMtable['FreqClust'] = kmeans.predict(RFMtable[['Frequency']])
# K-Means on Revenue feature
RFMtable['RevenueClust'] = kmeans.predict(RFMtable[['Revenue']])

In the cluster RFM table, each Recency, Frequency, and Revenue Cluster columns are not sequential. So it will be sorted from the best criterion. For the Recency feature, the best customers are the ones who have recently made an order to those who have not placed an order for a long time. Then, for the Frequency feature, the best customers are the ones who place orders frequently to those who place orders the least. The last feature is Revenue or Monetrary Value, the best customers are those placed orders with large revenue to small revenue for the company.

To make it easier to create an ordered_cluster function with the df table to be processed, clustername the name of the feature cluster column concerned, the targetname of the respective feature value column, and the ascending order type, True for ascending order and False for descending order.

def ordered_cluster(df,clustername,targetname,ascending):
df_new = df.groupby(clustername)[targetname].mean().sort_values(ascending=ascending).reset_index()
df_new['index'] = df_new.index+1 #So the cluster would start at cluster 1-4
df_fix = pd.merge(df,df_new[[clustername,'index']], on=clustername).drop(clustername,axis=1)
return df_fix
RFMtable = ordered_cluster(RFMtable,'RecencyClust','Recency',True)
RFMtable = ordered_cluster(RFMtable,'FreqClust','Frequency',False)
RFMtable = ordered_cluster(RFMtable,'RevenueClust','Revenue',False)
#Create RFMindex column

In the cluster of each feature, number 1 shows the best group and number 4 shows the worst group of the feature. These three features produce a 4 × 4 × 4 combination of clusters or 64 clusters. From this cluster, it is grouped again so that it is easier to take action.

  1. Best Customers (RFMindex 112, 121, 122) : Loyal and high spender customers.
  2. High-Spending New Customer (RFMindex 131, 132, 142): Recently active customers with low orders frequency (can be New Customers) but obtain big revenues.
  3. Low-Spending Loyal Customer (RFMindex 113, 123, 124): Loyal customers who recently active but low-spenders.
  4. Low Spending New Customer (RFMindex 133, 134, 143, 144): They’re recently active customers who hasn’t order much and obtain low revenue.
  5. About to Sleep Customer (RFMindex: 234, 244): Low spenders and rarely placed orders, hasn’t been active recently.
  6. Churned Customer (RFMindex 334, 343, 344, 434, 443, 444): Customers who hasn’t been active for a long time, low orders frequency, and low spenders.

In this article and several others, they used RFMscore which is the sum or average from the three features cluster. RFMscore for customers with RFMindex 114 and 411 could be the same, meanwhile these two customer have different behavior. The 114 one is loyal customers with low-spending and 411 customer is at risk customer who was ever loyal and high spenders but hasn’t active for a long time.

CustomerSegment = RFMtable[['CustomerID','RFMindex']]
def segmentation(index):
if index in ['112', '121', '122']:
return 'Best Customer'
elif index in ['131', '132', '142']:
return 'High-Spending New Customer'
elif index in ['113','123', '124']:
return 'Low-Spending Loyal Customer'
elif index in ['133', '134', '143', '144']:
return 'Low-Spending New Customer'
elif index in ['234', '244']:
return 'About to Sleep Customer'
elif index in ['334', '343', '344', '434', '443', '444']:
return 'Churned Customer'
return 'Not Specified'
CustomerSegment['Segmentation']=CustomerSegment['RFMindex'].apply(lambda x: segmentation(x))


These are recommendations to manage promotional funding for these customer segments.

  1. Best Customers: Keep the customers happy and feeling valued with rewards such as personalized messages and preferences.
  2. High-Spending New Customer: Because these customers spend a lot in their first encounters so it’s important to keep engage with them.
  3. Low-Spending Loyal Customer: Give incentive to increase spending.
  4. Low Spending New Customer: Keep them shopping, give discount or free shipping vouchers.
  5. About to Sleep Customer: Send a personalized email to reconnect and remind their account.
  6. Churned Customer: Communicate with their basis specific preferences.



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store