Exploratory data analysis of a medical cost prediction dataset on Kaggle

AI is changing healthcare in many ways. Some of the most promising ways in which AI can be used is the diagnosis of diseases by analyzing large datasets, developing personalized risk models, monitoring health data and providing decision support tools. The use of AI in medical cost prediction is still in its early stages but it has the potential to change the way we handle healthcare expenses. Some of the advantages of using AI for medical cost prediction are the following:

Increased transparency in healthcare costs. By making it easier to predict future medical costs, AI can help patients and providers understand the true cost of care. This can lead to more informed decision-making and a reduction in the overall cost of healthcare.
Improved risk management. By identifying patients who are at risk of high medical costs, AI can help providers and payers take steps to mitigate these risks. This can lead to a reduction in the number of surprise medical bills and a more equitable distribution of healthcare costs.
Enhanced patient care. By providing personalized medical cost predictions, AI can help patients make better decisions about their health care. This can lead to improved outcomes and a better quality of life for patients.

Let's take a look at a dataset from Kaggle. This dataset is a Medical insurance dataset that can be used to practice basic exploratory data analysis and some basic machine learning model building as well. You can download this dataset from Kaggle: https://www.kaggle.com/datasets/mirichoi0218/insurance. You need to know some data science libraries like Pandas, Numpy, Matplotlib, Scikit-learn and Numpy before following this tutorial.

Basic exploratory data analysis of the dataset

The Medical Cost Personal Dataset is a dataset that contains information about the medical costs of a sample of individuals in the United States. The dataset includes the following columns:

age: The age of the primary beneficiary.
sex: The gender of the primary beneficiary.
bmi: The body mass index of the primary beneficiary.
children: The number of children covered by health insurance.
smoker: Whether the primary beneficiary smokes.
region: The region of the United States where the primary beneficiary lives.
charges: The individual medical costs billed by health insurance.
The first step is to explore the first few rows of the dataset.

dataset = pd.read_csv("/kaggle/input/insurance/insurance.csv")
dataset.head(5)

The next step is to get an idea about the columns of the datasets. We need to get an idea about which columns are numerical, or categorical.

dataset.info()

One important step in exploratory data analysis is to check whether there are any null values in a dataset. We checked this dataset for null values and there are not any null values present in this dataset

dataset.isnull().values.any()

We also need to check for any outliers in the dataset. Here is some Python code that you can use to detect outliers in the Medical Cost Personal Datasets.

# Calculate the mean and standard deviation of the charges variable
mean = dataset["charges"].mean()
std = dataset["charges"].std()

# Define the upper and lower bounds of the outliers
upper_bound = mean + 3 * std
lower_bound = mean - 3 * std

# Identify the outliers
outliers = dataset[dataset["charges"] > upper_bound]

# Plot the distribution of the charges variable
plt.hist(dataset["charges"])
plt.axvline(upper_bound, color="red")
plt.axvline(lower_bound, color="red")
plt.show()

The next few steps are performing some one-hot encoding for cateogorical variables. The following code below does one-hot encoding.

dataset = pd.get_dummies(data=dataset, drop_first=True)

We can also check this dataset for correlation analysis. A correlation diagram can help us identify the relationships between different variables. The strength and direction of the relationship between two variables can be determined by the correlation coefficient, which is a number between -1 and 1. A correlation coefficient of 1 indicates a perfect positive correlation, meaning that as one variable increases, the other variable also increases. A correlation coefficient of -1 indicates a perfect negative correlation, meaning that as one variable increases, the other variable decreases.

corr = dataset.corr()
plt.figure(figsize=(20,10))
sns.heatmap(corr, annot=True, cmap='coolwarm')

We need to make sure that the shape of train and test datasets is consistent and this will be shown in the code snippet below:

x_train.shape, x_test.shape
y_train.shape, y_test.shape

Many machine learning algorithms, such as linear regression are sensitive to the scale of the features. Standard scaling ensures that all features are on a similar scale, which can improve the performance of these algorithms. The following code snippet below will do standard scaling.

from sklearn.preprocessing import StandardScaler

sc = StandardScaler()

x_train = sc.fit_transform(x_train)
x_test = sc.transform(x_test)

We have performed data cleaning to the dataset above. The next step is to train a linear regression model with the training data and evaluate the model with test data. The code snippet below will train and test the linear regression model on the dataet.

from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score

regressor_lr = LinearRegression()
regressor_lr.fit(x_train, y_train)
y_pred = regressor_lr.predict(x_test)
r2_score(y_test, y_pred)

In this blog post, we learned how to perform basic exploratory data analysis on the medical cost prediction dataset and also created a linear regression model and trained and tested it on the dataset.

Exploratory data analysis of a medical cost prediction dataset on Kaggle

Basic exploratory data analysis of the dataset

Did you find this article valuable?