Want to share your content on python-bloggers? click here.
PyCaret is an open-source machine learning library in Python that is designed to reduce the time and effort required for building, deploying and scaling machine learning models. It provides a simplified and unified interface for various machine learning tasks like classification, regression, clustering, natural language processing and anomaly detection. PyCaret is built on top of popular machine learning libraries like scikit-learn, XGBoost, LightGBM, SpaCy and others.
PyCaret aims to democratize machine learning and make it accessible to everyone, from novice to expert. It achieves this by providing an end-to-end machine learning workflow that includes data preparation, feature engineering, model training and evaluation, hyper parameter tuning and model deployment, all with just a few lines of code.
In this article, we will explore some of the benefits of using PyCaret and demonstrate how to use it for a simple classification problem.
Benefits of PyCaret:
- Simplifies the machine learning workflow: PyCaret provides a simple and intuitive API that abstracts away the complexities of machine learning. You can perform common machine learning tasks like data cleaning, feature engineering and model training with just a few lines of code.
- Speeds up model development: PyCaret provides a streamlined workflow that allows you to quickly iterate through different machine learning models and hyper parameters. You can compare the performance of multiple models and select the best one for your problem with just a few lines of code.
- Automates tedious tasks: PyCaret automates many of the tedious and time-consuming tasks involved in machine learning, such as data preprocessing, feature engineering and hyper parameter tuning. This allows you to focus on more important aspects of your project, such as understanding your data and interpreting your results.
- Provides a unified interface: PyCaret provides a unified interface for many different machine learning tasks, such as classification, regression and clustering. This means you can use the same API for different machine learning tasks, which simplifies the learning curve and reduces the time and effort required to learn new tools.
- Provides reproducibility: PyCaret provides a way to save and load trained models, which ensures reproducibility of results. This is especially important in production environments where you need to ensure that your models behave consistently over time.
Install PyCaret
First, make sure PyCaret is installed in your Python environment. You can install it using pip:
pip install pycaret
Classification Model
We need to load the data that we want to use for our classification model. For this tutorial, we’ll use the famous Titanic dataset that is often used for machine learning exercises.
The Titanic dataset is a classic example used in the field of machine learning and data science. It is a dataset that contains information about the passengers who were onboard the Titanic ship during its maiden voyage, which famously sank after hitting an iceberg on April 15, 1912.
The dataset contains a total of 891 observations and 12 variables, including:
- PassengerId: Unique identifier for each passenger
- Survived: Whether the passenger survived or not (0 = No, 1 = Yes)
- Pclass: The class of the passenger’s ticket (1 = 1st, 2 = 2nd, 3 = 3rd)
- Name: The name of the passenger
- Sex: The gender of the passenger
- Age: The age of the passenger in years
- SibSp: The number of siblings/spouses the passenger had aboard the Titanic
- Parch: The number of parents/children the passenger had aboard the Titanic
- Ticket: The ticket number of the passenger
- Fare: The passenger’s fare
- Cabin: The cabin number of the passenger
- Embarked: The port where the passenger embarked (C = Cherbourg, Q = Queenstown, S = Southampton)
The goal of machine learning models built using the Titanic dataset is to predict whether a given passenger survived or not, based on the other variables in the dataset. The dataset is often used as an introductory example for classification tasks, as it is relatively small and easy to understand, yet still presents some interesting challenges.
Step 1: Load dataset
import pandas as pd # Load the Titanic dataset from a CSV file titanic = pd.read_csv('https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv')
Step 2: Initialize PyCaret
Now we need to initialize PyCaret and create an instance of the Classification
class. We’ll use the setup()
function to set up our data for modeling.
from pycaret.classification import * # Set up the data for modeling clf = setup(data=titanic, target='Survived')
We pass the titanic
dataframe and the Survived
column as the target variable to setup()
. This initializes PyCaret and sets up the environment for building a classification model.
Step 3: Compare models
Next, we can use the compare_models()
function to compare the performance of several different classification models:
best_model = compare_models()
This will compare the performance of several different classification algorithms and return the best one based on a default evaluation metric. The best_model
variable will contain the best performing model.
Step 4: Create a model
Now that we have selected the best model, we can create a more refined version of it using the create_model()
function. This function will train the selected algorithm on the entire dataset and return the final model object.
# Create a more refined version of the best performing model tuned_model = create_model(best_model)
Step 5: Evaluate the model
Now we can evaluate the performance of our model using the evaluate_model()
function. This will generate a report that contains several common classification evaluation metrics such as accuracy, precision, recall, and F1 score.
# Evaluate the performance of the model evaluate_model(tuned_model)
Step 6: Make predictions
Finally, we can use our model to make predictions on new data using the predict_model()
function. This function will take in a new dataset and return the predicted class label for each observation.
# Make predictions on new data new_data = pd.DataFrame({ 'Pclass': [3, 1, 2], 'Sex': ['male', 'female', 'male'], 'Age': [22, 38, 26], 'SibSp': [1, 1, 0], 'Parch': [0, 0, 0], 'Fare': [7.25, 71.28, 7.92] }) predictions = predict_model(tuned_model, data=new_data)
This will return a dataframe containing the predicted class label for each observation.
PyCaret makes it easy to quickly iterate through different machine learning models and evaluate their performance, so you can quickly find the best model for your data.
Regression Model
For this tutorial, we’ll use the Boston Housing dataset that is often used for machine learning exercises.
The Boston Housing dataset is a classic example used in the field of machine learning and data science. It is a dataset that contains information about various factors that influence the median value of owner-occupied homes in the suburbs of Boston. This dataset is often used to build regression models to predict the median value of owner-occupied homes based on different factors.
The dataset contains a total of 506 observations and 14 variables, including:
- CRIM: per capita crime rate by town.
- ZN: proportion of residential land zoned for lots over 25,000 sq.ft.
- INDUS: proportion of non-retail business acres per town.
- CHAS: Charles River dummy variable (1 if tract bounds river; 0 otherwise).
- NOX: nitric oxides concentration (parts per 10 million).
- RM: average number of rooms per dwelling.
- AGE: proportion of owner-occupied units built prior to 1940.
- DIS: weighted distances to five Boston employment centers.
- RAD: index of accessibility to radial highways.
- TAX: full-value property-tax rate per $10,000.
- PTRATIO: pupil-teacher ratio by town.
- B: 1000(Bk – 0.63)^2 where Bk is the proportion of blacks by town.
- LSTAT: lower status of the population (percent).
- MEDV: median value of owner-occupied homes in $1000s.
The goal of machine learning models built using the Boston Housing dataset is to predict the median value of owner-occupied homes based on the other variables in the dataset. This dataset is often used as an introductory example for regression tasks, as it is relatively small and easy to understand, yet still presents some interesting challenges.
Step 1: Load dataset
import pandas as pd # Load the Boston Housing dataset from a CSV file boston = pd.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/BostonHousing.csv')
Step 2: Initialize PyCaret
Now we need to initialize PyCaret and create an instance of the Regression
class. We’ll use the setup()
function to set up our data for modeling.
from pycaret.regression import * # Set up the data for modeling reg = setup(data=boston, target='medv')
We pass the boston
dataframe and the medv
column as the target variable to setup()
. This initializes PyCaret and sets up the environment for building a regression model.
Step 3: Compare models
Next, we can use the compare_models()
function to compare the performance of several different regression models:
best_model = compare_models()
This will compare the performance of several different regression algorithms and return the best one based on a default evaluation metric. The best_model
variable will contain the best performing model.
Step 4: Create a model
Now that we have selected the best model, we can create a more refined version of it using the create_model()
function. This function will train the selected algorithm on the entire dataset and return the final model object.
# Create a more refined version of the best performing model tuned_model = create_model(best_model)
Step 5: Evaluate the model
Now we can evaluate the performance of our model using the evaluate_model()
function. This will generate a report that contains several common regression evaluation metrics such as RMSE, MAE, R2, and MAPE.
# Evaluate the performance of the model evaluate_model(tuned_model)
Step 6: Make predictions
Finally, we can use our model to make predictions on new data using the predict_model()
function. This function will take in a new dataset and return the predicted target value for each observation.
# Make predictions on new data new_data = pd.DataFrame({ 'crim': [0.06, 0.07, 0.01], 'zn': [20, 30, 40], 'indus': [6.9, 7.7, 2.9], 'chas': [0, 1, 0], 'nox': [0.4, 0.5, 0.6], 'rm': [6, 7, 8], 'age': [55, 60, 70], 'dis': [4, 5, 6], 'rad': [4, 5, 6], 'tax': [330, 350, 370], 'ptratio': [18, 19, 20], 'black': [380, 385, 390], 'lstat': [5, 6, 7] }) predictions = predict_model(tuned_model, data=new_data) ``
Association Analysis
PyCaret is a powerful open-source machine learning library that offers a wide range of tools for various tasks, including association analysis. Association analysis is a technique used to identify relationships between variables in a dataset. It is often used in market basket analysis to identify patterns of co-occurrence between items in a transactional dataset. In this tutorial, we will use PyCaret to perform association analysis on a transactional dataset.
First, let’s load our dataset. We will be using the “Online Retail II” dataset, which contains transactional data from an online retailer based in the UK. You can download the dataset from here.
The Online Retail II dataset is a transactional dataset containing customer purchases from an online retailer based in the UK between 2009 and 2011. The dataset is publicly available and can be downloaded from the UCI Machine Learning Repository.
The dataset contains approximately 1 million transactions, with each transaction representing a purchase made by a customer. The data includes the following fields:
- InvoiceNo: A unique identifier for each transaction.
- StockCode: A unique identifier for each product.
- Description: A description of the product.
- Quantity: The quantity of the product purchased in the transaction.
- InvoiceDate: The date and time of the transaction.
- UnitPrice: The unit price of the product in pounds sterling.
- CustomerID: A unique identifier for each customer.
- Country: The country where the transaction was made.
The dataset is a useful resource for various tasks, including market basket analysis, customer segmentation, and sales forecasting. The data is also interesting because it contains a range of different product categories, including gifts, household items, and clothing.
It is worth noting that the dataset contains some missing values, which will need to be handled appropriately if the data is to be used for analysis or modeling. Additionally, the dataset includes transactions from customers located in various countries, but for some tasks, it may be necessary to filter the data to only include transactions from a specific country or region.
Step 1: Load dataset
import pandas as pd # Load the dataset data = pd.read_excel('Online Retail.xlsx')
Step 2: Preprocess data
Next, we need to preprocess our data by cleaning and transforming it into the right format for association analysis. We will filter out rows with missing values, remove irrelevant columns, and transform our data into a format where each row represents a single transaction and each column represents an item in the transaction.
# Clean the data data = data.dropna() data = data[data['Quantity'] > 0] data = data[data['Country'] == 'United Kingdom'] # Transform the data basket = (data.groupby(['InvoiceNo', 'Description'])['Quantity'] .sum().unstack().reset_index().fillna(0) .set_index('InvoiceNo'))
In the code above, we first drop rows with missing values and transactions with a quantity of 0 or less. We then filter our data to only include transactions from the United Kingdom. Finally, we group our data by InvoiceNo and Description and sum the quantity of each item in each transaction. We then unstack the data, reset the index, fill in missing values with 0, and set the index to InvoiceNo.
Step 3: Perform Association Analysis
Now that we have preprocessed our data, we can use PyCaret’s association rule mining module to perform association analysis.
from pycaret.arules import * # Initialize the association rules module s = setup(basket, transaction_id='InvoiceNo', item_id='Description') # Create association rules rules = create_model()
In the code above, we first import PyCaret’s association rule mining module. We then initialize the module using the setup function and specify the transaction ID and item ID columns. Finally, we create association rules using the create_model function.
Step 4: Model Inspection
We can now use the generated rules to make predictions and perform further analysis. For example, we can use the inspect_model function to view the generated rules and their statistics.
# View the generated rules inspect_model(rules)
Step 5: Model prediction
We can also use the predict_model function to make predictions on new data.
# Make predictions on new data new_data = pd.DataFrame({'InvoiceNo': ['537879'], 'Description': ['JUMBO BAG PINK POLKADOT']}) predictions = predict_model(rules, data=new_data) print(predictions)
In the code above, we first create a new DataFrame with a single transaction containing the item “JUMBO BAG PINK POLKADOT”. We then use the predict_model function to make predictions on this new data.
Anomaly Detection
Anomaly detection is a technique used to identify rare or unusual observations in a dataset that may be of interest. In this tutorial, we will use PyCaret to perform anomaly detection on a dataset.
First, let’s load our dataset. We will be using the “Credit Card Fraud Detection” dataset, which contains credit card transactions that have been labeled as either fraudulent or non-fraudulent. You can download the dataset from here.
The Credit Card Fraud Detection dataset is a publicly available dataset that contains transactions made with credit cards in September 2013 by European cardholders. The dataset was collected and made available by Worldline and the Machine Learning Group of Université Libre de Bruxelles.
The dataset contains a total of 284,807 transactions, of which 492 (0.17%) are fraudulent. Each transaction contains the following fields:
- Time: The number of seconds elapsed between the first transaction in the dataset and the current transaction.
- V1 – V28: These are anonymized features obtained from a PCA transformation of the original features. These features represent numeric input variables that describe each transaction.
- Amount: The transaction amount.
- Class: A binary label indicating whether the transaction is fraudulent (1) or not (0).
The goal of this dataset is to build a machine learning model that can accurately detect fraudulent transactions based on the input features. The imbalanced nature of the dataset, with only a small fraction of transactions being fraudulent, makes this a challenging task.
This dataset is commonly used for machine learning research and is often used as a benchmark for evaluating the performance of different anomaly detection and fraud detection algorithms. It provides a useful resource for developing and testing new approaches to fraud detection, which is an important problem in many industries, including finance, e-commerce, and healthcare.
Step 1: Load dataset
import pandas as pd # Load the dataset data = pd.read_csv('creditcard.csv')
Step 2: Preprocess data
Next, we need to preprocess our data by cleaning and transforming it into the right format for anomaly detection. We will filter out rows with missing values, normalize our data, and split it into training and testing sets.
from sklearn.preprocessing import StandardScaler from sklearn.model_selection import train_test_split # Clean the data data = data.dropna() # Normalize the data scaler = StandardScaler() data['Amount'] = scaler.fit_transform(data[['Amount']]) # Split the data into training and testing sets train, test = train_test_split(data, test_size=0.2, random_state=42)
In the code above, we first drop rows with missing values. We then normalize the Amount column using the StandardScaler function from scikit-learn. Finally, we split our data into training and testing sets using the train_test_split function.
Step 3: Train model
Now that we have preprocessed our data, we can use PyCaret’s anomaly detection module to perform anomaly detection.
from pycaret.anomaly import * # Initialize the anomaly detection module s = setup(train, session_id=42) # Create an anomaly detection model model = create_model()
In the code above, we first import PyCaret’s anomaly detection module. We then initialize the module using the setup function and specify a session ID for reproducibility. Finally, we create an anomaly detection model using the create_model function.
Step 4: Perform Predictions
We can now use the generated model to make predictions and perform further analysis. For example, we can use the predict_model function to make predictions on our testing set and calculate performance metrics.
# Make predictions on the testing set predictions = predict_model(model, data=test) # Calculate performance metrics from sklearn.metrics import confusion_matrix, classification_report print(confusion_matrix(predictions['Class'], predictions['Anomaly'])) print(classification_report(predictions['Class'], predictions['Anomaly']))
In the code above, we first use the predict_model function to make predictions on our testing set. We then calculate performance metrics using the confusion_matrix and classification_report functions from scikit-learn.
Clustering
Clustering is a technique used in unsupervised learning to group similar objects together based on their features. PyCaret’s clustering module provides a variety of clustering algorithms that can be used for this purpose. In this tutorial, we will use the “Iris” dataset to demonstrate how to perform clustering using PyCaret.
The Iris dataset is a classic dataset in machine learning and statistics, and it is often used as a benchmark dataset for classification problems. The dataset contains information about different species of iris flowers, and the goal is to predict the species of a flower based on its measurements.
The dataset contains a total of 150 samples, with 50 samples for each of the three iris species: setosa, versicolor, and virginica. Each sample contains the following four features:
- Sepal Length: The length of the sepal (in cm).
- Sepal Width: The width of the sepal (in cm).
- Petal Length: The length of the petal (in cm).
- Petal Width: The width of the petal (in cm).
The dataset is often used for exploring and visualizing data, as well as for testing and comparing different classification algorithms. Because the dataset is relatively small and easy to work with, it is a popular choice for beginners learning about machine learning.
The Iris dataset is available in many machine learning libraries, including scikit-learn and PyCaret, making it easy to access and use for a wide range of applications.
Step 1: Load dataset
First, let’s load the dataset:
from pycaret.datasets import get_data data = get_data('iris')
In the code above, we use PyCaret’s get_data
function to load the “Iris” dataset.
Step 2: Setup clustering model
Next, let’s set up our clustering experiment using PyCaret’s setup
function:
from pycaret.clustering import setup clustering = setup(data, normalize=True, session_id=123)
In the code above, we use the setup
function to preprocess our data and set up our clustering experiment. We specify normalize=True
to normalize our data, and session_id=123
to ensure reproducibility.
Step 3: Compare models
Now that we have set up our experiment, we can compare different clustering algorithms using PyCaret’s compare_models
function:
from pycaret.clustering import compare_models best_model = compare_models()
In the code above, we use the compare_models
function to compare different clustering algorithms and select the best one based on performance.
Step 4: Predict on new data
We can now use the selected model to make predictions on our data:
from pycaret.clustering import predict_model predictions = predict_model(best_model, data=data)
In the code above, we use the predict_model
function to make predictions on our data using the best model selected by the compare_models
function.
Step 5: Visualize best model
We can also visualize the results of our clustering using PyCaret’s plot_model
function:
from pycaret.clustering import plot_model plot_model(best_model, plot='elbow')
In the code above, we use the plot_model
function to plot the elbow plot for our clustering algorithm, which helps us determine the optimal number of clusters.
Natural Language Processing (NLP)
PyCaret’s NLP module provides a wide range of tools for preprocessing and analyzing text data, including feature extraction, dimensionality reduction, and topic modeling. In this tutorial, we will use the “kiva” dataset to demonstrate how to perform NLP using PyCaret.
The Kiva dataset is a collection of loan descriptions and other data from the Kiva microfinance platform, which provides loans to individuals and small businesses in developing countries. The dataset contains information about the borrowers, the lenders, the loan amounts, and the loan descriptions, which are written by the borrowers themselves.
The goal of the dataset is to enable analysis and visualization of the loan descriptions, as well as to explore the relationship between the borrowers and the lenders. The dataset contains a total of 681,207 loan descriptions, which are written in a variety of different languages, including English, Spanish, French, and Arabic.
Each loan description in the dataset contains the following features:
- Loan ID: A unique identifier for the loan.
- Loan Description: A text description of the loan, written by the borrower.
- Loan Amount: The amount of the loan, in US dollars.
- Country: The country where the loan is being made.
- Sector: The sector of the economy where the borrower works.
- Activity: The specific activity that the borrower is engaged in.
- Currency: The currency of the loan.
- Posted Date: The date when the loan was posted on the Kiva platform.
- Funded Date: The date when the loan was fully funded by lenders.
- Term in Months: The length of the loan term, in months.
- Repayment Interval: The frequency of loan repayments (e.g., weekly, monthly).
- Lender Count: The number of lenders who contributed to the loan.
- Loan Status: The current status of the loan (e.g., funded, expired, defaulted).
The Kiva dataset is often used for exploring and visualizing text data, as well as for testing and comparing different natural language processing (NLP) algorithms. Because the dataset is relatively large and diverse, it is a popular choice for researchers and data scientists working in the fields of microfinance and NLP.
Step 1: Load dataset
First, let’s load the dataset:
from pycaret.datasets import get_data # load dataset data = get_data('kiva')
In the code above, we use PyCaret’s get_data
function to load the “Kiva” dataset, which contains loan descriptions from the Kiva microfinance platform.
Step 2: Setup experiment
Next, let’s set up our NLP experiment using PyCaret’s setup
function:
from pycaret.nlp import setup nlp = setup(data, target='en', session_id=123)
In the code above, we use the setup
function to preprocess our text data and set up our NLP experiment. We specify target='en'
to indicate that we want to analyze English text data, and session_id=123
to ensure reproducibility.
Step 3: Create model
Now that we have set up our experiment, we can extract features from our text data using PyCaret’s create_model
function:
from pycaret.nlp import create_model lda = create_model('lda')
In the code above, we use the create_model
function to extract features from our text data using Latent Dirichlet Allocation (LDA), a popular topic modeling algorithm.
Step 4: Predict on new data
We can now use the extracted features to make predictions on new text data using PyCaret’s predict_model
function:
from pycaret.nlp import predict_model new_data = ['This is a new loan description', 'Another loan description'] predictions = predict_model(lda, data=new_data)
In the code above, we use the predict_model
function to make predictions on new loan descriptions using the LDA model trained on our original dataset.
Step 5: Visualize results
We can also visualize the topics extracted from our text data using PyCaret’s plot_model
function:
from pycaret.nlp import plot_model plot_model(lda, plot='topic_model')
In the code above, we use the plot_model
function to plot the topics extracted from our text data using the LDA model.
PyCaret’s NLP module provides a powerful and flexible set of tools for analyzing text data, and it makes it easy to extract features and make predictions using a variety of different algorithms.
Model Explainability
PyCaret provides a range of tools for model explainability that can help data scientists and machine learning practitioners understand how their models are making predictions and which features are most important for those predictions. Here’s an example of using PyCaret for model explainability for diabetes dataset.
The diabetes dataset is a well-known and frequently used dataset in machine learning and data science. It contains medical data from 442 diabetes patients, including features such as age, sex, body mass index (BMI), blood pressure, and six blood serum measurements. The target variable is a quantitative measure of disease progression one year after baseline.
The dataset was originally collected and analyzed by Drs. Bradley Efron, Trevor Hastie, Iain Johnstone and Robert Tibshirani at Stanford University, and has since become a popular choice for machine learning practitioners due to its size, completeness, and relevance to real-world problems.
The dataset is often used as a benchmark for regression models, and has been the subject of numerous studies on the prediction of disease progression using machine learning algorithms. It is also frequently used as a teaching dataset in machine learning courses and workshops, due to its accessibility and well-documented nature.
Overall, the diabetes dataset is a valuable resource for data scientists and machine learning practitioners looking to develop and test regression models, and has played an important role in advancing the field of machine learning and data science.
# Load dataset from pycaret.datasets import get_data data = get_data('diabetes') # Create a model using PyCaret from pycaret.classification import * clf = setup(data, target='Class variable') model = create_model('lr') # Explain model predictions using SHAP values from pycaret.utils import explain_model shap_values = explain_model(model) # Visualize SHAP values using PyCaret's built-in plotting functions from pycaret.utils import plot_model plot_model(shap_values, plot='summary') plot_model(shap_values, plot='dependence', feature='Glucose')
In this example, we first load the diabetes dataset using the get_data()
function from PyCaret’s built-in datasets. We then create a logistic regression model using the setup()
and create_model()
functions from PyCaret’s classification module.
To explain the model predictions, we use PyCaret’s explain_model()
function to calculate SHAP (SHapley Additive exPlanations) values, which provide a way to understand the importance of each feature for each individual prediction. We can then visualize the SHAP values using PyCaret’s built-in plotting functions, such as plot_model()
.
In this example, we use plot_model()
to create a summary plot of the SHAP values, which shows the overall impact of each feature on the model predictions. We also use plot_model()
to create a dependence plot for the “Glucose” feature, which shows how the model predictions change as the value of the “Glucose” feature changes.
By using PyCaret’s built-in tools for model explainability, we can gain insights into how our model is making predictions and which features are most important for those predictions. This can help us to identify potential issues with our model and to improve its accuracy and performance.
Conclusion
In conclusion, PyCaret is a powerful and user-friendly machine learning library that simplifies the process of building and deploying machine learning models. It provides a range of tools for data preprocessing, feature engineering, model selection, and model evaluation, as well as a variety of built-in algorithms and models. With PyCaret, data scientists and machine learning practitioners can quickly build and deploy accurate and robust models without having to write complex code or spend a lot of time on model development. PyCaret is a valuable addition to any data scientist’s toolkit and has quickly become a popular choice among machine learning practitioners due to its simplicity, flexibility, and ease of use.
The post PyCaret: A unified ML platform for training and deploying Machine Learning Models at Scale appeared first on Hi! I am Nagdev.
Want to share your content on python-bloggers? click here.