You might have come across unsupervised machine learning algorithms while studying AI and machine learning. One of the simplest unsupervised machine learning algorithms used in real-life applications is k-means clustering. By definition, K-means clustering divides a group of data points into k clusters. The algorithm works by assigning data points to a cluster with the closest centroid and then updating the centroids to be the mean of the data points in each cluster. This process continues until the cluster assignments no longer change.
K-means clustering is easy to understand and implement. Scikit-learn Python library has a very simple implementation of k-means clustering available. In this blog post, we will be exploring the application of k-means clustering and apply it to the customer mall dataset from Kaggle: https://www.kaggle.com/code/krishnaraj30/clustering-segmentation-k-means-clusters
Dataset used
This dataset is created for learning purposes to understand customer segmentation concepts. We will demonstrate this by using an unsupervised machine learning technique called K-means clustering algorithm.
By the end of this analysis, you will learn:
- How to achieve customer segmentation using the K-means clustering algorithm in Python in a simple way.
Who your target customers are so you can start marketing strategies.
This dataset doesn't require extensive data cleaning hence we will be skipping any data cleaning steps. Lets move forward with checking some columns in the dataset.
K-means clustering algorithm
The very first step is to check the number of unique values in the education column in the dataset.
df.education.unique()
The education has categories of education and this needs to be converted into numerical values. In the following line of code below the education categories are converted into integer columns:
# make new data frame as df2
df2 = df.replace(['High School','Bachelor', 'Master', 'PhD'],['1', '2', '3', '4'])
# in this analysis we dont need country column.
df2 = df2.drop('country', axis=1)
# change object to int
df2["education"] = df2["education"].astype(int)
We will be using only the numerical columns in the dataset for k-means clustering. We want to use k-means clustering to get an idea about the number of customer segments with similar behavior patterns in the dataset
# slicing income and spending for segmentation using k-means clustering
X = df.iloc[:, [4, 7]].values
We will be implementing the k-means clustering algorithm from sci-kit learn. This algorithm will be tested for values of k ranging from 1 to 10.
# analyze segmantition
# find best number of k
wcss = []
for i in range(1, 11):
kmeans = KMeans(n_clusters = i, init = 'k-means++', random_state = 42)
kmeans.fit(X)
wcss.append(kmeans.inertia_)
plt.plot(range(1, 11), wcss)
plt.title('Elbow Method')
plt.xlabel('K')
plt.ylabel('WCSS')
plt.show()
The graph above shows that the elbow of the curve is at k=3. This explains that the optimum number of clusters in this dataset is 3. This means there are 3 kinds of customers in the dataset. Let's segment this dataset according to the number of clusters predicted by the k-means algorithm.
# operate k-means clustering to dataset
kmeans = KMeans(n_clusters = 3, init = 'k-means++', random_state = 42)
y_kmeans = kmeans.fit_predict(X)
# add segmentation at dataframe
df2['segmentation'] = y_kmeans
# Income and spending by segmentation
df2.groupby('segmentation').agg(
min_income=('income', 'min'),
max_income=('income', 'max'),
avg_income=('income', 'mean'),
min_spending=('spending', 'min'),
max_spending=('spending', 'max'),
avg_spending=('spending', 'mean'))
We have grouped the dataset according to the segmentation column and aggregated the income and spending. Each of the 3 customer groups are segmented based on their average income and spending.
I hope you found this article useful. I will be sharing similar articles on unsupervised machine-learning algorithms in the future.