Photo by Annie Spratt on Unsplash
Exploratory data analysis of titanic machine learning disaster dataset on Kaggle
The Titanic machine learning disaster dataset is a collection of data about the passengers on board the Titanic, including their names, ages, genders, socio-economic class, and whether or not they survived the sinking. The dataset is used to train machine learning models to predict whether a passenger would have survived the Titanic disaster.
The dataset is a valuable resource for machine learning practitioners and researchers. It is a challenging dataset, as there are many factors that could have influenced a passenger's survival, such as their age, gender, class, and location on the ship. However, the dataset is also well-structured and relatively clean, making it a good fit for machine learning algorithms.
In this blog post, we will be doing some exploratory data analysis using pandas library. The very first step is to import the import python libraries and then load the train and test datasets in Kaggle.
import numpy as np
import matplotlib.pyplot as plt
# Loading the basic libraries
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import os
The next step is to get an idea about our dataset. The following line of code will display the first few rows in the dataset
train.head(4)
The next step is to get an idea of the correlation across all features in the dataset. Hence, we can use the following line for this purpose.
cor = sns.heatmap(train.corr(numeric_only=True), annot=True, cmap='RdYlGn', linewidths=0.2)
figure = cor.get_figure()
In data science, it is important to perform data cleaning before we can use a dataset to train a model for making predictions. To perform data cleaning, we can remove missing values and NaN values from the dataset. The following line of code can be used for removing missing values
# Replace null values with zero
train = train.fillna(0, inplace=True)
We need to check which variables in the dataset are categorical and numerical using the following line of code
train.info()
Categorical Variable: Survived, Sex, Pclass, Embarked, Cabin, Name, Ticket, Sibsp and Parch
Numerical Variable: Fare, age and passengerId
We also need to check for outliers. The following function will check for outliers.
def detect_outliers(df,features):
outlier_indices = []
for c in features:
# 1st quartile
Q1 = np.percentile(df[c],25)
# 3rd quartile
Q3 = np.percentile(df[c],75)
# IQR
IQR = Q3 - Q1
# Outlier step
outlier_step = IQR * 1.5
# detect outlier and their indeces
outlier_list_col = df[(df[c] < Q1 - outlier_step) | (df[c] > Q3 + outlier_step)].index
# store indeces
outlier_indices.extend(outlier_list_col)
outlier_indices = Counter(outlier_indices)
multiple_outliers = list(i for i, v in outlier_indices.items() if v > 2)
return multiple_outliers
In the next blog post, we will do some feature engineering and model training on the titanic machine learning dataset