PyCaret: A unified ML platform for training and deploying Machine Learning Models at Scale

Nagdev Amruthnath

3 years ago

This article was first published on python Archives - Hi! I am Nagdev , and kindly contributed to python-bloggers. (You can report issue about the content on this page here)
Want to share your content on python-bloggers? click here.

PyCaret is an open-source machine learning library in Python that is designed to reduce the time and effort required for building, deploying and scaling machine learning models. It provides a simplified and unified interface for various machine learning tasks like classification, regression, clustering, natural language processing and anomaly detection. PyCaret is built on top of popular machine learning libraries like scikit-learn, XGBoost, LightGBM, SpaCy and others.

PyCaret aims to democratize machine learning and make it accessible to everyone, from novice to expert. It achieves this by providing an end-to-end machine learning workflow that includes data preparation, feature engineering, model training and evaluation, hyper parameter tuning and model deployment, all with just a few lines of code.

In this article, we will explore some of the benefits of using PyCaret and demonstrate how to use it for a simple classification problem.

Benefits of PyCaret:

Simplifies the machine learning workflow: PyCaret provides a simple and intuitive API that abstracts away the complexities of machine learning. You can perform common machine learning tasks like data cleaning, feature engineering and model training with just a few lines of code.
Speeds up model development: PyCaret provides a streamlined workflow that allows you to quickly iterate through different machine learning models and hyper parameters. You can compare the performance of multiple models and select the best one for your problem with just a few lines of code.
Automates tedious tasks: PyCaret automates many of the tedious and time-consuming tasks involved in machine learning, such as data preprocessing, feature engineering and hyper parameter tuning. This allows you to focus on more important aspects of your project, such as understanding your data and interpreting your results.
Provides a unified interface: PyCaret provides a unified interface for many different machine learning tasks, such as classification, regression and clustering. This means you can use the same API for different machine learning tasks, which simplifies the learning curve and reduces the time and effort required to learn new tools.
Provides reproducibility: PyCaret provides a way to save and load trained models, which ensures reproducibility of results. This is especially important in production environments where you need to ensure that your models behave consistently over time.

Install PyCaret

First, make sure PyCaret is installed in your Python environment. You can install it using pip:

pip install pycaret

Classification Model

We need to load the data that we want to use for our classification model. For this tutorial, we’ll use the famous Titanic dataset that is often used for machine learning exercises.

The Titanic dataset is a classic example used in the field of machine learning and data science. It is a dataset that contains information about the passengers who were onboard the Titanic ship during its maiden voyage, which famously sank after hitting an iceberg on April 15, 1912.

The dataset contains a total of 891 observations and 12 variables, including:

PassengerId: Unique identifier for each passenger
Survived: Whether the passenger survived or not (0 = No, 1 = Yes)
Pclass: The class of the passenger’s ticket (1 = 1st, 2 = 2nd, 3 = 3rd)
Name: The name of the passenger
Sex: The gender of the passenger
Age: The age of the passenger in years
SibSp: The number of siblings/spouses the passenger had aboard the Titanic
Parch: The number of parents/children the passenger had aboard the Titanic
Ticket: The ticket number of the passenger
Fare: The passenger’s fare
Cabin: The cabin number of the passenger
Embarked: The port where the passenger embarked (C = Cherbourg, Q = Queenstown, S = Southampton)

The goal of machine learning models built using the Titanic dataset is to predict whether a given passenger survived or not, based on the other variables in the dataset. The dataset is often used as an introductory example for classification tasks, as it is relatively small and easy to understand, yet still presents some interesting challenges.

Step 1: Load dataset

import pandas as pd

# Load the Titanic dataset from a CSV file
titanic = pd.read_csv('https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv')

Step 2: Initialize PyCaret

Now we need to initialize PyCaret and create an instance of the Classification class. We’ll use the setup() function to set up our data for modeling.

from pycaret.classification import *

# Set up the data for modeling
clf = setup(data=titanic, target='Survived')

We pass the titanic dataframe and the Survived column as the target variable to setup(). This initializes PyCaret and sets up the environment for building a classification model.

Step 3: Compare models

Next, we can use the compare_models() function to compare the performance of several different classification models:

best_model = compare_models()

This will compare the performance of several different classification algorithms and return the best one based on a default evaluation metric. The best_model variable will contain the best performing model.

Step 4: Create a model

Now that we have selected the best model, we can create a more refined version of it using the create_model() function. This function will train the selected algorithm on the entire dataset and return the final model object.

# Create a more refined version of the best performing model
tuned_model = create_model(best_model)

Step 5: Evaluate the model

Now we can evaluate the performance of our model using the evaluate_model() function. This will generate a report that contains several common classification evaluation metrics such as accuracy, precision, recall, and F1 score.

# Evaluate the performance of the model
evaluate_model(tuned_model)

Step 6: Make predictions

Finally, we can use our model to make predictions on new data using the predict_model() function. This function will take in a new dataset and return the predicted class label for each observation.

# Make predictions on new data
new_data = pd.DataFrame({
    'Pclass': [3, 1, 2],
    'Sex': ['male', 'female', 'male'],
    'Age': [22, 38, 26],
    'SibSp': [1, 1, 0],
    'Parch': [0, 0, 0],
    'Fare': [7.25, 71.28, 7.92]
})

predictions = predict_model(tuned_model, data=new_data)

This will return a dataframe containing the predicted class label for each observation.

PyCaret makes it easy to quickly iterate through different machine learning models and evaluate their performance, so you can quickly find the best model for your data.

Regression Model

For this tutorial, we’ll use the Boston Housing dataset that is often used for machine learning exercises.

The Boston Housing dataset is a classic example used in the field of machine learning and data science. It is a dataset that contains information about various factors that influence the median value of owner-occupied homes in the suburbs of Boston. This dataset is often used to build regression models to predict the median value of owner-occupied homes based on different factors.

The dataset contains a total of 506 observations and 14 variables, including:

CRIM: per capita crime rate by town.
ZN: proportion of residential land zoned for lots over 25,000 sq.ft.
INDUS: proportion of non-retail business acres per town.
CHAS: Charles River dummy variable (1 if tract bounds river; 0 otherwise).
NOX: nitric oxides concentration (parts per 10 million).
RM: average number of rooms per dwelling.
AGE: proportion of owner-occupied units built prior to 1940.
DIS: weighted distances to five Boston employment centers.
RAD: index of accessibility to radial highways.
TAX: full-value property-tax rate per $10,000.
PTRATIO: pupil-teacher ratio by town.
B: 1000(Bk – 0.63)^2 where Bk is the proportion of blacks by town.
LSTAT: lower status of the population (percent).
MEDV: median value of owner-occupied homes in $1000s.

The goal of machine learning models built using the Boston Housing dataset is to predict the median value of owner-occupied homes based on the other variables in the dataset. This dataset is often used as an introductory example for regression tasks, as it is relatively small and easy to understand, yet still presents some interesting challenges.

Step 1: Load dataset

import pandas as pd

# Load the Boston Housing dataset from a CSV file
boston = pd.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/BostonHousing.csv')

Step 2: Initialize PyCaret

Now we need to initialize PyCaret and create an instance of the Regression class. We’ll use the setup() function to set up our data for modeling.

from pycaret.regression import *

# Set up the data for modeling
reg = setup(data=boston, target='medv')

We pass the boston dataframe and the medv column as the target variable to setup(). This initializes PyCaret and sets up the environment for building a regression model.

Step 3: Compare models

Next, we can use the compare_models() function to compare the performance of several different regression models:

best_model = compare_models()

This will compare the performance of several different regression algorithms and return the best one based on a default evaluation metric. The best_model variable will contain the best performing model.

Step 4: Create a model

# Create a more refined version of the best performing model
tuned_model = create_model(best_model)

Step 5: Evaluate the model

Now we can evaluate the performance of our model using the evaluate_model() function. This will generate a report that contains several common regression evaluation metrics such as RMSE, MAE, R2, and MAPE.

# Evaluate the performance of the model
evaluate_model(tuned_model)

Step 6: Make predictions

Finally, we can use our model to make predictions on new data using the predict_model() function. This function will take in a new dataset and return the predicted target value for each observation.

# Make predictions on new data
new_data = pd.DataFrame({
    'crim': [0.06, 0.07, 0.01],
    'zn': [20, 30, 40],
    'indus': [6.9, 7.7, 2.9],
    'chas': [0, 1, 0],
    'nox': [0.4, 0.5, 0.6],
    'rm': [6, 7, 8],
    'age': [55, 60, 70],
    'dis': [4, 5, 6],
    'rad': [4, 5, 6],
    'tax': [330, 350, 370],
    'ptratio': [18, 19, 20],
    'black': [380, 385, 390],
    'lstat': [5, 6, 7]
})

predictions = predict_model(tuned_model, data=new_data)
``

Association Analysis

PyCaret is a powerful open-source machine learning library that offers a wide range of tools for various tasks, including association analysis. Association analysis is a technique used to identify relationships between variables in a dataset. It is often used in market basket analysis to identify patterns of co-occurrence between items in a transactional dataset. In this tutorial, we will use PyCaret to perform association analysis on a transactional dataset.

First, let’s load our dataset. We will be using the “Online Retail II” dataset, which contains transactional data from an online retailer based in the UK. You can download the dataset from here.

The Online Retail II dataset is a transactional dataset containing customer purchases from an online retailer based in the UK between 2009 and 2011. The dataset is publicly available and can be downloaded from the UCI Machine Learning Repository.

The dataset contains approximately 1 million transactions, with each transaction representing a purchase made by a customer. The data includes the following fields:

InvoiceNo: A unique identifier for each transaction.
StockCode: A unique identifier for each product.
Description: A description of the product.
Quantity: The quantity of the product purchased in the transaction.
InvoiceDate: The date and time of the transaction.
UnitPrice: The unit price of the product in pounds sterling.
CustomerID: A unique identifier for each customer.
Country: The country where the transaction was made.

The dataset is a useful resource for various tasks, including market basket analysis, customer segmentation, and sales forecasting. The data is also interesting because it contains a range of different product categories, including gifts, household items, and clothing.

It is worth noting that the dataset contains some missing values, which will need to be handled appropriately if the data is to be used for analysis or modeling. Additionally, the dataset includes transactions from customers located in various countries, but for some tasks, it may be necessary to filter the data to only include transactions from a specific country or region.

Step 1: Load dataset

import pandas as pd

# Load the dataset
data = pd.read_excel('Online Retail.xlsx')

Step 2: Preprocess data

Next, we need to preprocess our data by cleaning and transforming it into the right format for association analysis. We will filter out rows with missing values, remove irrelevant columns, and transform our data into a format where each row represents a single transaction and each column represents an item in the transaction.

# Clean the data
data = data.dropna()
data = data[data['Quantity'] > 0]
data = data[data['Country'] == 'United Kingdom']

# Transform the data
basket = (data.groupby(['InvoiceNo', 'Description'])['Quantity']
          .sum().unstack().reset_index().fillna(0)
          .set_index('InvoiceNo'))

In the code above, we first drop rows with missing values and transactions with a quantity of 0 or less. We then filter our data to only include transactions from the United Kingdom. Finally, we group our data by InvoiceNo and Description and sum the quantity of each item in each transaction. We then unstack the data, reset the index, fill in missing values with 0, and set the index to InvoiceNo.

Step 3: Perform Association Analysis

Now that we have preprocessed our data, we can use PyCaret’s association rule mining module to perform association analysis.

from pycaret.arules import *

# Initialize the association rules module
s = setup(basket, transaction_id='InvoiceNo', item_id='Description')

# Create association rules
rules = create_model()

In the code above, we first import PyCaret’s association rule mining module. We then initialize the module using the setup function and specify the transaction ID and item ID columns. Finally, we create association rules using the create_model function.

Step 4: Model Inspection

We can now use the generated rules to make predictions and perform further analysis. For example, we can use the inspect_model function to view the generated rules and their statistics.

# View the generated rules
inspect_model(rules)

Step 5: Model prediction

We can also use the predict_model function to make predictions on new data.

# Make predictions on new data
new_data = pd.DataFrame({'InvoiceNo': ['537879'], 'Description': ['JUMBO BAG PINK POLKADOT']})
predictions = predict_model(rules, data=new_data)
print(predictions)

In the code above, we first create a new DataFrame with a single transaction containing the item “JUMBO BAG PINK POLKADOT”. We then use the predict_model function to make predictions on this new data.

Anomaly Detection

Anomaly detection is a technique used to identify rare or unusual observations in a dataset that may be of interest. In this tutorial, we will use PyCaret to perform anomaly detection on a dataset.

First, let’s load our dataset. We will be using the “Credit Card Fraud Detection” dataset, which contains credit card transactions that have been labeled as either fraudulent or non-fraudulent. You can download the dataset from here.

The Credit Card Fraud Detection dataset is a publicly available dataset that contains transactions made with credit cards in September 2013 by European cardholders. The dataset was collected and made available by Worldline and the Machine Learning Group of Université Libre de Bruxelles.

The dataset contains a total of 284,807 transactions, of which 492 (0.17%) are fraudulent. Each transaction contains the following fields:

Time: The number of seconds elapsed between the first transaction in the dataset and the current transaction.
V1 – V28: These are anonymized features obtained from a PCA transformation of the original features. These features represent numeric input variables that describe each transaction.
Amount: The transaction amount.
Class: A binary label indicating whether the transaction is fraudulent (1) or not (0).

The goal of this dataset is to build a machine learning model that can accurately detect fraudulent transactions based on the input features. The imbalanced nature of the dataset, with only a small fraction of transactions being fraudulent, makes this a challenging task.

This dataset is commonly used for machine learning research and is often used as a benchmark for evaluating the performance of different anomaly detection and fraud detection algorithms. It provides a useful resource for developing and testing new approaches to fraud detection, which is an important problem in many industries, including finance, e-commerce, and healthcare.

Step 1: Load dataset

import pandas as pd

# Load the dataset
data = pd.read_csv('creditcard.csv')

Step 2: Preprocess data

Next, we need to preprocess our data by cleaning and transforming it into the right format for anomaly detection. We will filter out rows with missing values, normalize our data, and split it into training and testing sets.

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

# Clean the data
data = data.dropna()

# Normalize the data
scaler = StandardScaler()
data['Amount'] = scaler.fit_transform(data[['Amount']])

# Split the data into training and testing sets
train, test = train_test_split(data, test_size=0.2, random_state=42)

In the code above, we first drop rows with missing values. We then normalize the Amount column using the StandardScaler function from scikit-learn. Finally, we split our data into training and testing sets using the train_test_split function.

Step 3: Train model

Now that we have preprocessed our data, we can use PyCaret’s anomaly detection module to perform anomaly detection.

from pycaret.anomaly import *

# Initialize the anomaly detection module
s = setup(train, session_id=42)

# Create an anomaly detection model
model = create_model()

In the code above, we first import PyCaret’s anomaly detection module. We then initialize the module using the setup function and specify a session ID for reproducibility. Finally, we create an anomaly detection model using the create_model function.

Step 4: Perform Predictions

We can now use the generated model to make predictions and perform further analysis. For example, we can use the predict_model function to make predictions on our testing set and calculate performance metrics.

# Make predictions on the testing set
predictions = predict_model(model, data=test)

# Calculate performance metrics
from sklearn.metrics import confusion_matrix, classification_report

print(confusion_matrix(predictions['Class'], predictions['Anomaly']))
print(classification_report(predictions['Class'], predictions['Anomaly']))

In the code above, we first use the predict_model function to make predictions on our testing set. We then calculate performance metrics using the confusion_matrix and classification_report functions from scikit-learn.

Clustering

Clustering is a technique used in unsupervised learning to group similar objects together based on their features. PyCaret’s clustering module provides a variety of clustering algorithms that can be used for this purpose. In this tutorial, we will use the “Iris” dataset to demonstrate how to perform clustering using PyCaret.

The Iris dataset is a classic dataset in machine learning and statistics, and it is often used as a benchmark dataset for classification problems. The dataset contains information about different species of iris flowers, and the goal is to predict the species of a flower based on its measurements.

The dataset contains a total of 150 samples, with 50 samples for each of the three iris species: setosa, versicolor, and virginica. Each sample contains the following four features:

Sepal Length: The length of the sepal (in cm).
Sepal Width: The width of the sepal (in cm).
Petal Length: The length of the petal (in cm).
Petal Width: The width of the petal (in cm).

The dataset is often used for exploring and visualizing data, as well as for testing and comparing different classification algorithms. Because the dataset is relatively small and easy to work with, it is a popular choice for beginners learning about machine learning.

The Iris dataset is available in many machine learning libraries, including scikit-learn and PyCaret, making it easy to access and use for a wide range of applications.

Step 1: Load dataset

First, let’s load the dataset:

from pycaret.datasets import get_data

data = get_data('iris')

In the code above, we use PyCaret’s get_data function to load the “Iris” dataset.

Step 2: Setup clustering model

Next, let’s set up our clustering experiment using PyCaret’s setup function:

from pycaret.clustering import setup

clustering = setup(data, normalize=True, session_id=123)

In the code above, we use the setup function to preprocess our data and set up our clustering experiment. We specify normalize=True to normalize our data, and session_id=123 to ensure reproducibility.

Step 3: Compare models

Now that we have set up our experiment, we can compare different clustering algorithms using PyCaret’s compare_models function:

from pycaret.clustering import compare_models

best_model = compare_models()

In the code above, we use the compare_models function to compare different clustering algorithms and select the best one based on performance.

Step 4: Predict on new data

We can now use the selected model to make predictions on our data:

from pycaret.clustering import predict_model

predictions = predict_model(best_model, data=data)

In the code above, we use the predict_model function to make predictions on our data using the best model selected by the compare_models function.

Step 5: Visualize best model

We can also visualize the results of our clustering using PyCaret’s plot_model function:

from pycaret.clustering import plot_model

plot_model(best_model, plot='elbow')

In the code above, we use the plot_model function to plot the elbow plot for our clustering algorithm, which helps us determine the optimal number of clusters.

Natural Language Processing (NLP)

PyCaret’s NLP module provides a wide range of tools for preprocessing and analyzing text data, including feature extraction, dimensionality reduction, and topic modeling. In this tutorial, we will use the “kiva” dataset to demonstrate how to perform NLP using PyCaret.

The Kiva dataset is a collection of loan descriptions and other data from the Kiva microfinance platform, which provides loans to individuals and small businesses in developing countries. The dataset contains information about the borrowers, the lenders, the loan amounts, and the loan descriptions, which are written by the borrowers themselves.

The goal of the dataset is to enable analysis and visualization of the loan descriptions, as well as to explore the relationship between the borrowers and the lenders. The dataset contains a total of 681,207 loan descriptions, which are written in a variety of different languages, including English, Spanish, French, and Arabic.

Each loan description in the dataset contains the following features:

Loan ID: A unique identifier for the loan.
Loan Description: A text description of the loan, written by the borrower.
Loan Amount: The amount of the loan, in US dollars.
Country: The country where the loan is being made.
Sector: The sector of the economy where the borrower works.
Activity: The specific activity that the borrower is engaged in.
Currency: The currency of the loan.
Posted Date: The date when the loan was posted on the Kiva platform.
Funded Date: The date when the loan was fully funded by lenders.
Term in Months: The length of the loan term, in months.
Repayment Interval: The frequency of loan repayments (e.g., weekly, monthly).
Lender Count: The number of lenders who contributed to the loan.
Loan Status: The current status of the loan (e.g., funded, expired, defaulted).

The Kiva dataset is often used for exploring and visualizing text data, as well as for testing and comparing different natural language processing (NLP) algorithms. Because the dataset is relatively large and diverse, it is a popular choice for researchers and data scientists working in the fields of microfinance and NLP.

Step 1: Load dataset

First, let’s load the dataset:

from pycaret.datasets import get_data

# load dataset
data = get_data('kiva')

In the code above, we use PyCaret’s get_data function to load the “Kiva” dataset, which contains loan descriptions from the Kiva microfinance platform.

Step 2: Setup experiment

Next, let’s set up our NLP experiment using PyCaret’s setup function:

from pycaret.nlp import setup

nlp = setup(data, target='en', session_id=123)

In the code above, we use the setup function to preprocess our text data and set up our NLP experiment. We specify target='en' to indicate that we want to analyze English text data, and session_id=123 to ensure reproducibility.

Step 3: Create model

Now that we have set up our experiment, we can extract features from our text data using PyCaret’s create_model function:

from pycaret.nlp import create_model

lda = create_model('lda')

In the code above, we use the create_model function to extract features from our text data using Latent Dirichlet Allocation (LDA), a popular topic modeling algorithm.

Step 4: Predict on new data

We can now use the extracted features to make predictions on new text data using PyCaret’s predict_model function:

from pycaret.nlp import predict_model

new_data = ['This is a new loan description', 'Another loan description']
predictions = predict_model(lda, data=new_data)

In the code above, we use the predict_model function to make predictions on new loan descriptions using the LDA model trained on our original dataset.

Step 5: Visualize results

We can also visualize the topics extracted from our text data using PyCaret’s plot_model function:

from pycaret.nlp import plot_model

plot_model(lda, plot='topic_model')

In the code above, we use the plot_model function to plot the topics extracted from our text data using the LDA model.

PyCaret’s NLP module provides a powerful and flexible set of tools for analyzing text data, and it makes it easy to extract features and make predictions using a variety of different algorithms.

Model Explainability

PyCaret provides a range of tools for model explainability that can help data scientists and machine learning practitioners understand how their models are making predictions and which features are most important for those predictions. Here’s an example of using PyCaret for model explainability for diabetes dataset.

The diabetes dataset is a well-known and frequently used dataset in machine learning and data science. It contains medical data from 442 diabetes patients, including features such as age, sex, body mass index (BMI), blood pressure, and six blood serum measurements. The target variable is a quantitative measure of disease progression one year after baseline.

The dataset was originally collected and analyzed by Drs. Bradley Efron, Trevor Hastie, Iain Johnstone and Robert Tibshirani at Stanford University, and has since become a popular choice for machine learning practitioners due to its size, completeness, and relevance to real-world problems.

The dataset is often used as a benchmark for regression models, and has been the subject of numerous studies on the prediction of disease progression using machine learning algorithms. It is also frequently used as a teaching dataset in machine learning courses and workshops, due to its accessibility and well-documented nature.

Overall, the diabetes dataset is a valuable resource for data scientists and machine learning practitioners looking to develop and test regression models, and has played an important role in advancing the field of machine learning and data science.

# Load dataset
from pycaret.datasets import get_data
data = get_data('diabetes')

# Create a model using PyCaret
from pycaret.classification import *
clf = setup(data, target='Class variable')
model = create_model('lr')

# Explain model predictions using SHAP values
from pycaret.utils import explain_model
shap_values = explain_model(model)

# Visualize SHAP values using PyCaret's built-in plotting functions
from pycaret.utils import plot_model
plot_model(shap_values, plot='summary')
plot_model(shap_values, plot='dependence', feature='Glucose')

In this example, we first load the diabetes dataset using the get_data() function from PyCaret’s built-in datasets. We then create a logistic regression model using the setup() and create_model() functions from PyCaret’s classification module.

To explain the model predictions, we use PyCaret’s explain_model() function to calculate SHAP (SHapley Additive exPlanations) values, which provide a way to understand the importance of each feature for each individual prediction. We can then visualize the SHAP values using PyCaret’s built-in plotting functions, such as plot_model().

In this example, we use plot_model() to create a summary plot of the SHAP values, which shows the overall impact of each feature on the model predictions. We also use plot_model() to create a dependence plot for the “Glucose” feature, which shows how the model predictions change as the value of the “Glucose” feature changes.

By using PyCaret’s built-in tools for model explainability, we can gain insights into how our model is making predictions and which features are most important for those predictions. This can help us to identify potential issues with our model and to improve its accuracy and performance.

Conclusion

In conclusion, PyCaret is a powerful and user-friendly machine learning library that simplifies the process of building and deploying machine learning models. It provides a range of tools for data preprocessing, feature engineering, model selection, and model evaluation, as well as a variety of built-in algorithms and models. With PyCaret, data scientists and machine learning practitioners can quickly build and deploy accurate and robust models without having to write complex code or spend a lot of time on model development. PyCaret is a valuable addition to any data scientist’s toolkit and has quickly become a popular choice among machine learning practitioners due to its simplicity, flexibility, and ease of use.

The post PyCaret: A unified ML platform for training and deploying Machine Learning Models at Scale appeared first on Hi! I am Nagdev.

To leave a comment for the author, please follow the link and comment on their blog: python Archives - Hi! I am Nagdev .

Want to share your content on python-bloggers? click here.