You might have heard of chatGPT or Bard automating various data science and data visualization tasks. In this blog post, we will use chatgpt for a real data science project. A real data science project includes the following steps that can be automated using a chatGPT language model:
Understand the business problem and define the problem statement.
Gather and prepare the data.
Explore and analyze the data.
Build and train machine learning models.
Evaluate the models and deploy them.
ChatGPT can also be used to create documentation for a data science project. However, it is important to consider the limitations of chatGPT and that it can be biased and can make mistakes while generating code for a data science project. In this blog post, we will be using the Crab Age Dataset from Kaggle. The Crag Age dataset can be downloaded from the following link: https://www.kaggle.com/competitions/playground-series-s3e16
Understanding the business problem
ChatGPT or Bard needs certain kinds of prompts for a certain data science task code generation. The very first step is to understand the business problem. In this case, we will be using the following prompt: "Become a data science expert and explain me the business problem for the following dataset: https://www.kaggle.com/competitions/playground-series-s3e16 "
Bard gave the following explanation to explain the business problem
The business problem for the crab age prediction dataset is to develop a machine learning model that can accurately predict the age of a crab based on a set of features, such as the crab's size, weight, and sex. This information could be used by crabbers to optimize their catch and ensure that they are not catching crabs that are too young. It could also be used by scientists to study crab populations and track their growth over time.
The dataset contains 1,200 rows and 8 columns. The columns are:
crab_id: A unique identifier for each crab
sex: The crab's sex (male or female)
length: The crab's length in centimeters
weight: The crab's weight in grams
width: The crab's width in centimeters
height: The crab's height in centimeters
age: The crab's age in days
shell_condition: The crab's shell condition (good, fair, poor)
According to the information provided by Bard, the target variable is the "age" column and the goal is to build a model that can accurately predict the age of a crab given the features in the dataset.
Exploratory Data Analysis
Exploratory data analysis (EDA) is the very step when dealing with any data science problem. EDA involves data manipulation, statistical analysis and data visualization. Let's give Bard the following prompt “ Become a data science expert and Please write a Python code to load and perform Exploratory data analysis (EDA) on the crab age dataset”
# loading the dataset and displaying a pandas dataframe
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# Load the train and test datasets
train_data = pd.read_csv('/kaggle/input/playground-series-s3e16/train.csv').drop('id', axis=1)
test_data = pd.read_csv('/kaggle/input/playground-series-s3e16/test.csv')
train_data.head()
# Get some basic information about the dataset
train_data.info()
# Descriptive statistics of the numerical features
train_data.describe()
We also want to display the missing values in all columns in the dataset. In addition to this, we can also generate a correlation matrix of the numerical features.
# Check for missing values
train_data.isnull().sum()
# Correlation matrix of the numerical features
train_data.corr()
# Distribution of the target variable
sns.histplot(train_data["Age"])
According to the histogram above, the age seems to be normally distributed. The maximum number of crabs have an age of 10 years.
Feature engineering
We will provide Bard with the following prompt to generate code for feature engineering: "“Write a Python code to perform feature engineering”. We would need to change some of the code generated since there are 3 kinds of sex in crabs.
# Create a dictionary to map the sex categories to numerical values
sex_map = {"M": 1, "F": 0,"I": 0}
# Apply the map function to the sex column
train_data["Sex"] = train_data["Sex"].map(sex_map)
test_data["Sex"] = test_data["Sex"].map(sex_map)
Pre-processing the dataset
This stage involves cleaning the data and handling any imbalance of labels in the dataset. We will provide Bard with the following prompt: "Now write a python code to clean and preprocess the dataset crab age”. I will only be using some of the generated code for this tutorial. We will be dropping the age column from the train_dataset. We will also be standardizing both the train and test datasets.
from sklearn.preprocessing import StandardScaler
y_train = train_data['Age']
train_data = train_data.drop("Age", axis=1)
scaler=StandardScaler()
scaled_x_train = scaler.fit_transform(train_data)
scaled_x_test = scaler.transform(test_data)
Model selection
The stage of model selection involves choosing an appropriate regression model for predicting the age of the crabs. We can provide the following prompt for model selection: " generate code for using a linear regression model on the crab age dataset"
from sklearn.linear_model import LinearRegression
lr = LinearRegression()
lr.fit(scaled_x_train,y_train)
Model evaluation
In this stage, we will make predictions for the age of the crabs using linear regression and submit the solution to kaggle.
prediction_lr = lr.predict(scaled_x_test)
prediction_lr
submission = pd.DataFrame(data={"id": test_ids, "Age": prediction_lr.flatten()})
submission["Age"] = submission["Age"].round().astype(int)
submission.head()
Conclusion
I hope you find this article useful. Bard/ChatGPT are powerful language models that can be used to generate code for data science projects. However, it is important to remember that Bard is not perfect and we should always review the code that it generates before using it. We should also be aware of the limitations of Bard and use it in conjunction with our own knowledge and expertise.
Here are some specific things to keep in mind when using Bard for data science projects:
Bard is still under development, so it may not be able to generate code for all types of data science projects.
Bard may not be able to generate code that is as efficient or as accurate as code that is written by a human.
Bard may generate code that contains errors.