Customer analysis is the process of collecting, analyzing, and interpreting customer data to gain insights into the behavior and preferences of customers. Analysis of customer data can be used to increase customer satisfaction and increase sales of products and services. There are a number of ways to perform customer analysis, including:
Surveys
Customer Feedback
Transactional Data
Social media data
Python can be used to analyze customer data to gain insights into the behavior and preferences of customers. In this blog post, I will cover a number of steps involved in analyzing customer data.
Cleaning and formatting data: Python can be used to clean and analyze data
Exploratory data analysis: Python can be used to perform EDA tasks such as plotting graphs and calculating descriptive statistics.
Machine learning: Machine learning algorithms can be used to identify hidden patterns in data and make predictions about customer behavior
Dataset
We will be using a dataset of customer spending from Kaggle: https://www.kaggle.com/datasets/goyaladi/customer-spending-dataset
This dataset contains the following columns:
Customer ID: A unique identifier for each customer.
Gender: The customer's gender.
Name: Name of customer
Age: The customer's age.
Education: The highest level of education
Income ($): The customer's annual income.
Spending: Annual spending
Country: Country of origin
Customer Purchase: Frequency of customer purchases
This dataset can be used to analyze customer demographics, income, spending habits, and purchase behavior. It can be used to segment customers into different groups, identify customer pain points, and develop marketing campaigns that target specific customer segments.
Exploratory Data Analysis:
The very first step in analyzing customer data is to import the essential libraries in Python:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.linear_model import LinearRegression
import plotly.express as px
from sklearn.cluster import KMeans
from mpl_toolkits.mplot3d import Axes3D
import warnings
warnings.filterwarnings("ignore")
%matplotlib inline
The next step is to read the data from a csv file and inspect the data
df=pd.read_csv("/kaggle/input/customer-spending-dataset/customer_data.csv")
df.info()
df.describe()
An essential step in data analysis is checking a dataset for null values.
df.isnull().sum()
Data Visualization
Seaborn is a Python library commonly used for data visualizations. A Seaborn count plot is a type of bar graph that is used to visualize the distribution of categorical data. It shows the number of observations in each category of a categorical variable.
Let's use a seaborn count plot on two categorical columns, education, and gender in this dataset.
sns.countplot(data=df,x='gender')
sns.countplot(data=df,x='education')
Let's use a box plot to explore data distribution and outliers
sns.boxplot(x='education', y='income', data=df)
plt.title('Income Distribution by Education')
plt.xlabel('Education')
plt.ylabel('Income')
plt.show()
The box plot above is for income distribution by education. Let's use a bar plot to check the relationship between education level and income.
plt.figure(figsize=(10,5))
sns.barplot(x='education', y='income', data=df)
plt.title('Average Income by Education')
plt.xlabel('Education')
plt.ylabel('Average Income')
plt.show()
The bar plot above shows that people with Master's and Bachelors's degrees tend to have the highest income. Let's explore the relationship between gender and income. Let's plot a bar graph between the gender and income variables.
plt.figure(figsize=(10,5))
sns.barplot(x='gender', y='income', data=df)
plt.title('Average Income by Gender')
plt.xlabel('Education')
plt.ylabel('Average Income')
plt.show()
The bar graph above shows that the male gender tends to have a higher income compared to the female gender.
Let's try using a scatter plot to explore the relationship between different variables:
plt.figure(1 , figsize = (15 , 6))
for gender in ['Male' , 'Female']:
plt.scatter(x = 'age' , y = 'income' , data = df[df['gender'] == gender] ,s = 200 , alpha = 0.5 , label = gender)
plt.xlabel('Age'), plt.ylabel('Income')
plt.title('Age vs Income w.r.t Gender')
plt.legend()
plt.show()
Another kind of data visualization technique is a correlation heatmap. Correlation analysis can be used to identify which variables have a positive correlation.
sns.heatmap(df.corr(), annot = True)
As seen in the correlation heatmap above, variables like customer spending and purchase frequency are highly correlated.
We can also analyze the relationship between purchase frequency and spending variables using a box plot.
sns.boxplot(x=df.purchase_frequency,y=df.spending)
plt.title("Distribution of spending by purchase frequency")
Feature engineering and Machine Learning Analysis
We have to do some feature engineering before we can use any machine learning algorithm on the dataset. We will be removing any unnecessary columns and also separating categorical variables from numerical variables.
To transform categorical variables into numerical ones, we'll utilize one-hot encoding.
df_preprocess = df.join(pd.get_dummies(df.education))
df_preprocess = df_preprocess.drop('education',axis=1)
df_preprocess = df_preprocess.join(pd.get_dummies(df.gender))
df_preprocess = df_preprocess.drop('gender',axis=1)
df_preprocess = df_preprocess.join(pd.get_dummies(df.country))
df_preprocess = df_preprocess.drop('country',axis=1)
The next step is to standardize some of the numerical variables.
scaler = MinMaxScaler()
df_preprocess['income'] = scaler.fit_transform(df_preprocess.income.values.reshape(-1,1))
df_preprocess['age'] = scaler.fit_transform(df_preprocess.age.values.reshape(-1,1))
df_preprocess['spending'] = scaler.fit_transform(df_preprocess.spending.values.reshape(-1,1))
Once we have converted categorical variables into numerical ones and also standardized the dataset, we can move towards applying a machine learning algorithm called principal component analysis (PCA). PCA is a powerful tool for dimensionality reduction and visualization. It can be used to simplify complex data sets and to identify the underlying patterns in the data. The scatter plot from a PCA can be a valuable tool for understanding the data and for identifying clusters of data points. Let's explore the PCA technique on this dataset and visualize it using a scatter plot.
from sklearn.decomposition import PCA
df_num = df_preprocess[['age','income','purchase_frequency', 'spending',
'Female', 'Male', 'Bachelor', 'High School', 'Master', 'PhD']]
pca = PCA(n_components=2)
pca_df2 = pd.DataFrame(pca.fit_transform(df_num))
pca.explained_variance_
sns.set(rc = {"figure.figsize":(5,5)})
sns.scatterplot(x=pca_df2.iloc[:,0], y=pca_df2.iloc[:,1])
plt.title('PC1 against PC2')
A scatter plot from a PCA can tell us a lot about the data. It can show us the relationships between the different variables in the data, as well as the clusters or groups of data points.
The first principal component (PC1) is the line that accounts for the most variance in the data. The second principal component (PC2) is the line that accounts for the second most variance in the data, and so on. The scatter plot will show the data points projected onto the PC1 and PC2 axes.
The direction of the PC1 axis tells us which variables are most correlated with each other. If two variables are positively correlated, they will both move in the same direction along the PC1 axis. If two variables are negatively correlated, they will move in opposite directions along the PC1 axis.
According to the scatter plot above, there are eight distinct clusters. Thank you for reading this article. I hope you found it useful.