A beginners guide to scikit-learn

·

3 min read

Scikit-learn is an open-source machine-learning library used by machine-learning practitioners and data scientists. Scikit-learn offers an easy-to-use and well-documented library for implementing all kinds of basic supervised and unsupervised machine learning algorithms. Let's first explore some basic supervised and unsupervised machine learning algorithms.

Supervised machine learning algorithms

Supervised machine learning algorithms are a type of machine learning algorithm used to categorize objects given a certain amount of labelled data.

Some of the most common supervised machine learning algorithms include:

  • Linear regression: This algorithm is used to predict a continuous value.

  • Logistic regression: This algorithm is used to predict a binary value.

  • K-nearest neighbours: This algorithm is used to predict the class of a new data point based on the k most similar data points in the training set.

  • Support vector machines: This algorithm is used to classify data points into two or more classes.

  • Random forests: This algorithm is a collection of decision trees. It is used to classify data points into two or more classes.

Let's explore each of the above algorithms on a sample dataset.

Linear regression

import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression

# Load the dataset
data = pd.read_csv("dataset.csv")

# Split the dataset into features and labels
X = data[["feature1", "feature2"]]
y = data["label"]

# Create a linear regression model
model = LinearRegression()

# Fit the model to the training data
model.fit(X, y)

# Predict the labels for new data points
predictions = model.predict(X)

Logistic Regression

import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression


# Load the dataset
data = pd.read_csv("dataset.csv")

# Split the dataset into features and labels
X = data[["feature1", "feature2"]]
y = data["label"]

# Create a logistic regression model
model = LogisticRegression()

# Fit the model to the training data
model.fit(X, y)

# Predict the labels for new data points
predictions = model.predict(X)

K-nearest neighbours

import numpy as np
import pandas as pd
from sklearn.neighbors import KNeighborsClassifier

# Load the dataset
data = pd.read_csv("dataset.csv")

# Split the dataset into features and labels
X = data[["feature1", "feature2"]]
y = data["label"]

# Create a k-nearest neighbors model
model = KNeighborsClassifier(n_neighbors=5)

# Fit the model to the training data
model.fit(X, y)

# Predict the labels for new data points
predictions = model.predict(X)

Support Vector Machines

import numpy as np
import pandas as pd
from sklearn.svm import SVC

# Load the dataset
data = pd.read_csv("dataset.csv")

# Split the dataset into features and labels
X = data[["feature1", "feature2"]]
y = data["label"]

# Create a support vector machines model
model = SVC()

# Fit the model to the training data
model.fit(X, y)

# Predict the labels for new data points
predictions = model.predict(X)

Random Forest

import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestClassifier

# Load the dataset
data = pd.read_csv("dataset.csv")

# Split the dataset into features and labels
X = data[["feature1", "feature2"]]
y = data["label"]

# Create a random forests model
model = RandomForestClassifier(n_estimators=100)

# Fit the model to the training data
model.fit(X, y)

# Predict the labels for new data points
predictions = model.predict(X)

You can use any of these classification algorithms for solving various kinds of real-world problems. For instance, we can use a support vector machine to classify drugs based on a set of patient attributes. I hope you find the article useful. In my next blog post, I will be exploring some of these libraries for drug classification datasets from Kaggle.

Did you find this article valuable?

Support Iqra by becoming a sponsor. Any amount is appreciated!