Exploratory data analysis of titanic machine learning disaster dataset on Kaggle

·

3 min read

The Titanic machine learning disaster dataset is a collection of data about the passengers on board the Titanic, including their names, ages, genders, socio-economic class, and whether or not they survived the sinking. The dataset is used to train machine learning models to predict whether a passenger would have survived the Titanic disaster.

The dataset is a valuable resource for machine learning practitioners and researchers. It is a challenging dataset, as there are many factors that could have influenced a passenger's survival, such as their age, gender, class, and location on the ship. However, the dataset is also well-structured and relatively clean, making it a good fit for machine learning algorithms.

In this blog post, we will be doing some exploratory data analysis using pandas library. The very first step is to import the import python libraries and then load the train and test datasets in Kaggle.

import numpy as np
import matplotlib.pyplot as plt
# Loading the basic libraries
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import os

The next step is to get an idea about our dataset. The following line of code will display the first few rows in the dataset

train.head(4)

The next step is to get an idea of the correlation across all features in the dataset. Hence, we can use the following line for this purpose.

cor = sns.heatmap(train.corr(numeric_only=True), annot=True, cmap='RdYlGn', linewidths=0.2)
figure = cor.get_figure()

In data science, it is important to perform data cleaning before we can use a dataset to train a model for making predictions. To perform data cleaning, we can remove missing values and NaN values from the dataset. The following line of code can be used for removing missing values

# Replace null values with zero
train = train.fillna(0, inplace=True)

We need to check which variables in the dataset are categorical and numerical using the following line of code

train.info()

  • Categorical Variable: Survived, Sex, Pclass, Embarked, Cabin, Name, Ticket, Sibsp and Parch

  • Numerical Variable: Fare, age and passengerId

We also need to check for outliers. The following function will check for outliers.

def detect_outliers(df,features):
    outlier_indices = []

    for c in features:
        # 1st quartile
        Q1 = np.percentile(df[c],25)
        # 3rd quartile
        Q3 = np.percentile(df[c],75)
        # IQR
        IQR = Q3 - Q1
        # Outlier step
        outlier_step = IQR * 1.5
        # detect outlier and their indeces
        outlier_list_col = df[(df[c] < Q1 - outlier_step) | (df[c] > Q3 + outlier_step)].index
        # store indeces
        outlier_indices.extend(outlier_list_col)

    outlier_indices = Counter(outlier_indices)
    multiple_outliers = list(i for i, v in outlier_indices.items() if v > 2)

    return multiple_outliers

In the next blog post, we will do some feature engineering and model training on the titanic machine learning dataset

Did you find this article valuable?

Support Iqra by becoming a sponsor. Any amount is appreciated!