What Is PyCaret?
PyCaret is an open-source machine learning library inspired by the
CARET package in
R written in
Python. Its goal is to employ minimal code and lesser hypotheses to gain insights during a cycle of machine learning exploration and development. With the help of this library, we can quickly and effectively carry out end-to-end machine learning experiments.
Why Use PyCaret?
Using this library requires minimal programming to run any machine learning experiment. We can carry out sophisticated machine learning experiments in a flexible manner thanks to
Operations utilizing this library are automated and stored in the
PyCaret Pipeline, which is completely coordinated for and aimed at. This library also allows us to quickly transition from data analysis to model building and deployment. It also aids in automating a variety of tasks, such as changing categorical data or adding missing values, engineering the existing features, or enhancing hyperparameters in the existing data of models.
We will experiment with a classification use case with
PyCaret on the Default of credit card client dataset from
Kaggle in order to predict whether a
Customer will default or not. This prediction will be determined by a number of features that we'll examine in this tutorial.
- Jupyter Notebook or Visual Studio with Jupyter extension
- Python 3.6+ version
- PyCaret latest version and release notes can be found here.
As a first step toward our process, we’ll install the required libraries to perform our operations, and our dependencies include
This will be our main trump card that enables us to leverage the
ML pipelines for end-to-end execution.
Pandas is an open-source software library built on top of the
Python programming language for analyzing, cleansing, exploring, and manipulating data.
Let’s now include these dependencies using the pip commands.
pip install pycaret pandas
In the next step, we will import the dependencies to perform various operations.
import pandas as pd from pycaret.classification import *
Loading The Data Set
Credit card data set that we will use can be downloaded from
here. Download the data set into the folder you’re working on.
Once the download is finished, we can load the data set using
df = pd.read_csv(UCI_Credit_Card.csv)
Here, we’re assigning the loaded data set to a variable called
df (data frame), which is the generally used naming convention used by developers. You have every liberty to use your own variable name as per your convenience.
In order to ensure that the dataset has loaded and to view it, we’ll use
The top five rows of the data frame are shown by default when using
Python's head method. It only accepts one parameter, which is the number of rows. This parameter allows us to specify the number of rows to display.
df.tail()throws the details of the last five rows
Getting Data Set - Method 2
We can also grab the curated data sets from the
PyCaret Data repository, which consists of fifty-six popular data sets for various use cases such as classification, regression, clustering, NLP, etc. Mind you, for your own data set you cannot employ this method. You can view the list and specifications of those datasets with just two lines of code.
from pycaret.datasets import get_data all_datasets = pycaret.datasets.get_data('index')
The 6th one shows our
Credit card dataset, which can be loaded by,
dataset_name = 'credit' data = pycaret.datasets.get_data(dataset_name)
Model Training And Evaluation
We will use the
setup() function to train and evaluate our model.
PyCaret automates multiple time-consuming preprocessing tasks by standardizing and readily assembling basic data preparation procedures into time-saving repeatable workflows. Users have the option to automate cleaning, divide data into train and test sets, some aspects of feature engineering, and training. For example, by handling missing values with one of the available imputation methods. While many of the items produced during this process aren't clearly displayed to the user (such as
test sets or
label vectors), more seasoned practitioners can still access them if necessary or wanted.
clf = setup(data = credit_dataset, target='default', session_id=123)
The above code renders the information regarding the preprocessing pipeline which was enabled when
setup() is put into effect.
We also used the
session id = 123 option in our experiment. Our goal is to achieve reproducibility. It is not required to be used, and however, if it is not, a random number will be produced instead.
Now the stage is set for the final show, which is our model training.
best_model = compare_models()
The above code will train our model and it is the
compare_models() function that does the magic for us.
Kappa classification metrics are utilized to evaluate every model in the model library by this function. The outcomes provide a list of the
top-performing models at a specific moment.
Ridge Classifier is our best-performing model in this instance. Various learning algorithms are included in the list. However, the learning algorithm that performs the best is the only one in which we are interested. We let the others go.
The test accuracy yielded after testing the model is 0.8159, which is comparable to what we have got in training results. Yet this decline may be due to some overfitting issues that need to be inspected. This issue can be tamed by applying techniques such as
early stopping, thereby narrowing the gap between train and test results.
Prediction On Dataset
Next, we’ll carry out a prediction on the credit data set that we run:
prediction = predict_model(best_model, data = credit_dataset) prediction.tail()
You can now see that a new column has been appended at the end of the data set, which indicates our prediction.
Value 1 stands for True - Customer will default, and
Value 0 stands for False - Customers will not default. We can use the
tail functions according to our wish and for complete prediction, just remove the
The concluding step in this end-to-end
ML pipeline is to save the model to our local system. For that run:
Enter the code below to load our stored model:
model = load_model('ridge-model')
Voila!!! So in less than twenty lines of code, we were able to execute the entire
ML classification pipeline.
PyCaret is definitely going to be a most loved package in the Data ecosystem; new features and integrations are being released on a day-to-day basis. It saves a lot of time, and its ease of execution is helping to democratize the
ML Data Science Life Cycle to a great extent. Mastering
PyCaret, along with
PyTorch, etc., will definitely add value to the arsenal of every