K-Means Algorithm in Python

From Sustainability Methods

THIS ARTICLE IS STILL IN EDITING MODE

Introduction

K-means is one of the simplest unsupervised learning algorithms that solve the clustering problem, in other words, classify a given dataset on a certain number of clusters (assume k clusters) fixed a priori.

Theoretical aspect

The main idea is to define k centroids, one for each cluster. These centroids should be placed in a cunning way because different location causes different result. Therefore, the better choice is to place them as much as possible far away from each other. The next step is to take each point belonging to a given data set and associate it to the nearest centroid. When no point is pending, the first step is completed and an early group is done. At this point, it is needed to recalculate k new centroids as centers of the clusters resulting from the previous step. After that a new binding has to be done between the same data points and the nearest new centroid. A loop has been generated. As a result of this loop the centroids change their location step by step until no more changes are done, so centroids do not move any more.

For a given dataset X containing n multidimensional data points and the number of categories k to be divided, the Euclidean distance is selected as the similarity index and the clustering aims to minimize the sum of the squares of the various types:

K means form.png

where k - cluster centers, n - number of data points, uk - the kth center, and xi the ith point in the data set.

The algorithm consists of the following steps:
1. Place k points in the space represented by the objects that are being clustered. These points are considered as initial group centroids.
2. Assign each object to the group that has the closest centroid.
3. When all objects have been assigned, recalculate the positions of the k centroids.
4. Repeat Step 2 and 3 until the centroids stop moving.

Python Example

RFM analysis

Recency, frequency and monetary (RFM) analysis is a powerful and recognized technique in database marketing. It is widely used to rank the customers based on their prior purchasing history.

1. Recency: When was the last time the customer made a purchase?
2. Frequency: How many times did the customer purchase?
3. Monetary: How much money did the customer spend?

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import datetime as dt

from sklearn.cluster import KMeans
from sklearn.neighbors import NearestNeighbors

from sklearn import metrics

#loading dataset (we use online retail dataset)
df = pd.read_excel(r"filepath\online_retail.xlsx")
df
Column1 CustomerID OrderID Date Revenue date_format YEAR month
0 111382 13930366 43771330 20191023 5834 2019-10-23 2019 10
1 111385 11545838 43723657 20191023 5834 2019-10-23 2019 10

Recency

# last date available in our dataset
df["date_format"].max() # Out: Timestamp('2020-04-19 00:00:00')

# setting max time as now to calculate time differences
now = dt.date(2020,4,19)
# group by customer by last date they purchased

df_recency = df.groupby(["CustomerID"] , as_index = False)["date_format"].max()
df_recency.columns = ["CustomerID" , "LastPurchaseDate"]
df_recency["LastPurchaseDate"] = pd.DatetimeIndex(df_recency.LastPurchaseDate).date
df_recency.head()
CustomerID LastPurchaseDate
0 465132 2019-11-30
1 465164 2020-02-28
2 465198 2020-04-13
3 465204 2019-11-03
4 465211 2020-01-22
# calculate how often the customers are buying in the last few days
df_recency['Recency'] = df_recency.LastPurchaseDate.apply(lambda x : (now - x).days)

# dropping LastPurchase Date
df_recency.drop(columns=['LastPurchaseDate'],inplace=True)

# checking recencyFrequency
df_recency.head()
CustomerID Recency
0 465132 141
1 465164 51
2 465198 6
3 465204 168
4 465211 88

Frequency

frequency_df = df.groupby("CustomerID",as_index = False)["OrderID"].count()
frequency_df = frequency_df.sort_values(by = "OrderID" , ascending=False)
frequency_df = frequency_df.rename(columns={"OrderID" :"Frequency"})
frequency_df.head()
CustomerID Frequency
1525 502561 231
66329 5556237 180
136996 11826511 159
137116 11836775 153
164161 14188909 143

Monetary

# check summed up spends of customers

monetary_df = df.groupby('CustomerID',as_index=False)["Revenue"].sum()
monetary_df.columns = ['CustomerID','Monetary']
monetary_df.head()
CustomerID Monetary
0 465132 7302
1 465164 578
2 465198 5688
3 465204 1464
4 465211 4760
# merging these three features
rf = df_recency.merge(frequency_df,left_on='CustomerID',right_on='CustomerID')

# combining with monetary values
rfm = rf.merge(monetary_df,left_on='CustomerID',right_on='CustomerID')

rfm.set_index('CustomerID',inplace=True)

# saving to file
rfm.to_csv('rfm.csv', index=False)
#rf_new = rfm.sort_values(by = "Frequency" , ascending = False)

# checking the dataframe
rfm.head()
CustomerID Recency Frequency Monetary
465132 141 1 7302
465164 51 1 578
465198 6 1 5688
465204 168 1 1464
465211 88 1 4760

Segmentation and Evaluation

In the literature several approaches have been proposed to determine the number of clusters for k-mean clustering algorithm, for instance: i) Rule of thumb; ii) Elbow method; iii) Information Criterion Approach; iv) Information Theoretic Approach; v) Silhouette method and vi) Cross-validation.

In this case we will use Elbow method to determine the number of clusters. The basic idea of the elbow rule is to use a squareed distance between the sample points in each cluster and the centroid of the cluster to give a series of K values. The sum of squared errors (SSE) is used as a performance indicator. The method iterates over the K-value and calculates the SSE.

# creating a copy of the dataframe
rfm_segmentation = rfm.copy()

# find the proper number of clusters
fig,ax = plt.subplots(figsize=(10,8))
wcss = []
for i in range(1,11):
  kmeans = KMeans(n_clusters = i, init = 'k-means++', max_iter = 300, n_init = 10, random_state = 0)
  kmeans.fit(rfm_segmentation)
  wcss.append(kmeans.inertia_)
plt.plot(range(1,11), wcss, 'o')
plt.plot(range(1 , 11) , wcss , '-' , alpha = 0.5)
plt.title('Elbow Method')
plt.xlabel('Number of Clusters')
plt.ylabel('WCSS')
plt.savefig('Elbow_Method.png')
plt.show()

K means plot1.png

The graph shows, that the curve changes its behaviour from steep to moderate slope at the point of 4 clusters, therefore, we will choose 3 as the proper number of clusters for k-means segmentation. This will be included in the model as a parameter.

# instantiating the model

kmeans = KMeans(n_clusters = 3, init = 'k-means++', max_iter = 300, n_init = 10, random_state = 0)
y_kmeans = kmeans.fit_predict(rfm_segmentation)

# creating a column for the cluster

rfm_segmentation['Cluster'] = kmeans.labels_

# checking the column
rfm_segmentation[rfm_segmentation.Cluster == 0].head()
CustomerID Recency Frequency Monetary Cluster
465132 141 1 7302 0
465164 51 1 578 0
465198 6 1 5688 0
465204 168 1 1464 0
465211 88 1 4760 0

As we applied clustering, we are now able to analyze Recency, Frequency and Monetary values for each group.

Recency rate analysis

fig, ax = plt.subplots(figsize=(8,8))
plt.title('Recency for each Cluster')
sns.boxplot(rfm_segmentation.Cluster,rfm_segmentation.Recency)
plt.savefig("recency_clusters.png")

K means plot2.png

Cluster 0 has high recency rate, respectively, which means it has been the longest for any cluster when it comes to Last Purchase Date.
Cluster 1 and 2 have low recency rate, which is good. They can be our Gold and Silver Customers.

Frequency rate analysis

fig, ax = plt.subplots(figsize=(8,8))
plt.title('Frequency for each Cluster')
sns.boxplot(rfm_segmentation.Cluster,rfm_segmentation.Frequency);
plt.savefig('frequency_cluster.png')

K means plot3.png

Cluster 0 has a low frequency rate, which means consumers in this cluster are not very frequent.
Clusters 1 and 2 have high frequency rates, which puts them even further in the race for Gold and Silver.

Monetary rate analysis

fig, ax = plt.subplots(figsize=(8,8))
plt.title('Monetary for each Cluster')
sns.boxplot(rfm_segmentation.Cluster,rfm_segmentation.Monetary);
plt.savefig('monetary_cluster.png')

K means plot4.png

Cluster 0 has a small monetary rate, which could be referred to Bronze Customers.
Cluster 2 has a medium level monetary rate, which makes this Cluster to be Silver Customers.
Cluster 1 has the highest monetary rate, suggesting this Cluster to be Gold Customers.

Result

We may conclude the following results, based on analyses:

  • Gold Customers are in Cluster 1.
  • Silver Customers are in Cluster 2.
  • Bronze Customers are in Cluster 0.

Visualization

# getting the values
X = rfm_segmentation.iloc[:, [0,1,2]].values

# 2d plot
fig, ax = plt.subplots(figsize=(10,10))
plt.scatter(X[:, 0], X[:, 1], c=y_kmeans, s=100, cmap='OrRd')

centers = kmeans.cluster_centers_
plt.scatter(centers[:, 0], centers[:, 1], c='black', s=200, alpha=0.5)

K means plot5.png

# checking number of clients per cluster
rfm_segmentation.Cluster.value_counts()
0 179083
2 581
1 16

Strenghts and Drawbacks of k-means Algorithm

Strenghts:
1. Simple.
2. Fast for low dimensional data.
3. It can find pure sub clusters, if large number of clusters is specified.
4. With a large number of variables, k-Means may be viewed as computationally faster than hierarchical clustering.
5. K-Means is expected to produce tighter clusters than hierarchical clustering.

Drawbacks:
1. Sensitive to the selection of initial cluster center.
2. There is no rule for the decision of value of k and sensitive to initial value , for different initial value there will be different result.
3. This algorithm is easy to be effected by abnormal points.
4. It may contain dead unit problem.

References

1. https://www.researchgate.net/publication/313554124_Review_on_Determining_of_Cluster_in_K-means_Clustering

2. https://www.mdpi.com/2571-8800/2/2/16


The author of this entry is Faezeh Moradi. Edited by Evgeniya Zakharova.