K-Means Algorithm in Python

THIS ARTICLE IS STILL IN EDITING MODE

Introduction

K-means is one of the simplest unsupervised learning algorithms that solve the clustering problem, in other words, classify a given dataset on a certain number of clusters (assume k clusters) fixed a priori.

Theoretical aspect

The main idea is to define k centroids, one for each cluster. These centroids should be placed in a cunning way because different location causes different result. Therefore, the better choice is to place them as much as possible far away from each other. The next step is to take each point belonging to a given data set and associate it to the nearest centroid. When no point is pending, the first step is completed and an early group is done. At this point, it is needed to recalculate k new centroids as centers of the clusters resulting from the previous step. After that a new binding has to be done between the same data points and the nearest new centroid. A loop has been generated. As a result of this loop the centroids change their location step by step until no more changes are done, so centroids do not move any more.

For a given dataset X containing n multidimensional data points and the number of categories k to be divided, the Euclidean distance is selected as the similarity index and the clustering aims to minimize the sum of the squares of the various types:

where k - cluster centers, n - number of data points, u_k - the k^th center, and x_i the ith point in the data set.

The algorithm consists of the following steps:
1. Place k points in the space represented by the objects that are being clustered. These points are considered as initial group centroids.
2. Assign each object to the group that has the closest centroid.
3. When all objects have been assigned, recalculate the positions of the k centroids.
4. Repeat Step 2 and 3 until the centroids stop moving.

Python Example

RFM analysis

Recency, frequency and monetary (RFM) analysis is a powerful and recognized technique in database marketing. It is widely used to rank the customers based on their prior purchasing history.

1. Recency: When was the last time the customer made a purchase?
2. Frequency: How many times did the customer purchase?
3. Monetary: How much money did the customer spend?

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import datetime as dt

from sklearn.cluster import KMeans
from sklearn.neighbors import NearestNeighbors

from sklearn import metrics

#loading dataset (we use online retail dataset)
df = pd.read_excel(r"filepath\online_retail.xlsx")
df

	Column1	CustomerID	OrderID	Date	Revenue	date_format	YEAR	month
0	111382	13930366	43771330	20191023	5834	2019-10-23	2019	10
1	111385	11545838	43723657	20191023	5834	2019-10-23	2019	10

Recency

# last date available in our dataset
df["date_format"].max() # Out: Timestamp('2020-04-19 00:00:00')

# setting max time as now to calculate time differences
now = dt.date(2020,4,19)

# group by customer by last date they purchased

df_recency = df.groupby(["CustomerID"] , as_index = False)["date_format"].max()
df_recency.columns = ["CustomerID" , "LastPurchaseDate"]
df_recency["LastPurchaseDate"] = pd.DatetimeIndex(df_recency.LastPurchaseDate).date
df_recency.head()

	CustomerID	LastPurchaseDate
0	465132	2019-11-30
1	465164	2020-02-28
2	465198	2020-04-13
3	465204	2019-11-03
4	465211	2020-01-22

# calculate how often the customers are buying in the last few days
df_recency['Recency'] = df_recency.LastPurchaseDate.apply(lambda x : (now - x).days)

# dropping LastPurchase Date
df_recency.drop(columns=['LastPurchaseDate'],inplace=True)

# checking recencyFrequency
df_recency.head()

	CustomerID	Recency
0	465132	141
1	465164	51
2	465198	6
3	465204	168
4	465211	88

Frequency

frequency_df = df.groupby("CustomerID",as_index = False)["OrderID"].count()
frequency_df = frequency_df.sort_values(by = "OrderID" , ascending=False)
frequency_df = frequency_df.rename(columns={"OrderID" :"Frequency"})
frequency_df.head()

	CustomerID	Frequency
1525	502561	231
66329	5556237	180
136996	11826511	159
137116	11836775	153
164161	14188909	143

Monetary

# check summed up spends of customers

monetary_df = df.groupby('CustomerID',as_index=False)["Revenue"].sum()
monetary_df.columns = ['CustomerID','Monetary']
monetary_df.head()

	CustomerID	Monetary
0	465132	7302
1	465164	578
2	465198	5688
3	465204	1464
4	465211	4760

# merging these three features
rf = df_recency.merge(frequency_df,left_on='CustomerID',right_on='CustomerID')

# combining with monetary values
rfm = rf.merge(monetary_df,left_on='CustomerID',right_on='CustomerID')

rfm.set_index('CustomerID',inplace=True)

# saving to file
rfm.to_csv('rfm.csv', index=False)
#rf_new = rfm.sort_values(by = "Frequency" , ascending = False)

# checking the dataframe
rfm.head()

CustomerID	Recency	Frequency	Monetary
465132	141	1	7302
465164	51	1	578
465198	6	1	5688
465204	168	1	1464
465211	88	1	4760

Segmentation and Evaluation

In the literature several approaches have been proposed to determine the number of clusters for k-mean clustering algorithm, for instance: i) Rule of thumb; ii) Elbow method; iii) Information Criterion Approach; iv) Information Theoretic Approach; v) Silhouette method and vi) Cross-validation.

In this case we will use Elbow method to determine the number of clusters. The basic idea of the elbow rule is to use a squareed distance between the sample points in each cluster and the centroid of the cluster to give a series of K values. The sum of squared errors (SSE) is used as a performance indicator. The method iterates over the K-value and calculates the SSE.

# creating a copy of the dataframe
rfm_segmentation = rfm.copy()

# find the proper number of clusters
fig,ax = plt.subplots(figsize=(10,8))
wcss = []
for i in range(1,11):
  kmeans = KMeans(n_clusters = i, init = 'k-means++', max_iter = 300, n_init = 10, random_state = 0)
  kmeans.fit(rfm_segmentation)
  wcss.append(kmeans.inertia_)
plt.plot(range(1,11), wcss, 'o')
plt.plot(range(1 , 11) , wcss , '-' , alpha = 0.5)
plt.title('Elbow Method')
plt.xlabel('Number of Clusters')
plt.ylabel('WCSS')
plt.savefig('Elbow_Method.png')
plt.show()

The graph shows, that the curve changes its behaviour from steep to moderate slope at the point of 4 clusters, therefore, we will choose 3 as the proper number of clusters for k-means segmentation. This will be included in the model as a parameter.

# instantiating the model

kmeans = KMeans(n_clusters = 3, init = 'k-means++', max_iter = 300, n_init = 10, random_state = 0)
y_kmeans = kmeans.fit_predict(rfm_segmentation)

# creating a column for the cluster

rfm_segmentation['Cluster'] = kmeans.labels_

# checking the column
rfm_segmentation[rfm_segmentation.Cluster == 0].head()

CustomerID	Recency	Frequency	Monetary
465132	141	1	7302
465164	51	1	578
465198	6	1	5688
465204	168	1	1464
465211	88	1	4760

As we applied clustering, we are now able to analyze Recency, Frequency and Monetary values for each group.

Recency rate analysis

fig, ax = plt.subplots(figsize=(8,8))
plt.title('Recency for each Cluster')
sns.boxplot(rfm_segmentation.Cluster,rfm_segmentation.Recency)
plt.savefig("recency_clusters.png")

Cluster 0 has high recency rate, respectively, which means it has been the longest for any cluster when it comes to Last Purchase Date.
Cluster 1 and 2 have low recency rate, which is good. They can be our Gold and Silver Customers.

Frequency rate analysis

fig, ax = plt.subplots(figsize=(8,8))
plt.title('Frequency for each Cluster')
sns.boxplot(rfm_segmentation.Cluster,rfm_segmentation.Frequency);
plt.savefig('frequency_cluster.png')

Cluster 0 has a low frequency rate, which means consumers in this cluster are not very frequent.
Clusters 1 and 2 have high frequency rates, which puts them even further in the race for Gold and Silver.

Monetary rate analysis

fig, ax = plt.subplots(figsize=(8,8))
plt.title('Monetary for each Cluster')
sns.boxplot(rfm_segmentation.Cluster,rfm_segmentation.Monetary);
plt.savefig('monetary_cluster.png')

Cluster 0 has a small monetary rate, which could be referred to Bronze Customers.
Cluster 2 has a medium level monetary rate, which makes this Cluster to be Silver Customers.
Cluster 1 has the highest monetary rate, suggesting this Cluster to be Gold Customers.

Result

We may conclude the following results, based on analyses:

Gold Customers are in Cluster 1.
Silver Customers are in Cluster 2.
Bronze Customers are in Cluster 0.

Visualization

# getting the values
X = rfm_segmentation.iloc[:, [0,1,2]].values

# 2d plot
fig, ax = plt.subplots(figsize=(10,10))
plt.scatter(X[:, 0], X[:, 1], c=y_kmeans, s=100, cmap='OrRd')

centers = kmeans.cluster_centers_
plt.scatter(centers[:, 0], centers[:, 1], c='black', s=200, alpha=0.5)

# checking number of clients per cluster
rfm_segmentation.Cluster.value_counts()

0	179083
2	581
1	16

Strenghts and Drawbacks of k-means Algorithm

Strenghts:
1. Simple.
2. Fast for low dimensional data.
3. It can find pure sub clusters, if large number of clusters is specified.
4. With a large number of variables, k-Means may be viewed as computationally faster than hierarchical clustering.
5. K-Means is expected to produce tighter clusters than hierarchical clustering.

Drawbacks:
1. Sensitive to the selection of initial cluster center.
2. There is no rule for the decision of value of k and sensitive to initial value , for different initial value there will be different result.
3. This algorithm is easy to be effected by abnormal points.
4. It may contain dead unit problem.

References

1. https://www.researchgate.net/publication/313554124_Review_on_Determining_of_Cluster_in_K-means_Clustering

2. https://www.mdpi.com/2571-8800/2/2/16

The author of this entry is Faezeh Moradi. Edited by Evgeniya Zakharova.