Python-bloggers

Parallelisation of Model Evaluation and Hyperparameter Tuning in Sci Kit Learn

This article was first published on Python – Hutsons-hacks , and kindly contributed to python-bloggers. (You can report issue about the content on this page here)
Want to share your content on python-bloggers? click here.

Hello, it is me again for another post on how to make sci-kit perform at the top of its game.

Amping up the Model Evaluation process

Model evaluation in sci-kit learn can be achieved by the cross_val_score. This performs repeated stratified K fold resampling and assess the model accuracy across each of those folds. This is done to get a better sense of model representation in the testing data. This is done instead of a simple hold out split, as that is a one-off sample of what the accuracy of the model in the wild is. Normally, these resamples are averaged. However, I am coming at this from the point of optimising the speed of the resamples.

Bring in the necessary imports and available CPUs

This part of the code will look similar to the post on optimising model training:

from time import time
from sklearn.datasets import make_classification
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
import os
import matplotlib.pyplot as plt


# Get CPU cores
cpu_count = os.cpu_count()
print(f"This machine has {cpu_count} cores")
reserved_cpu = 1
final_cpu = int(cpu_count - reserved_cpu)
print("Saving one CPU so my PC does not lock up")

Please refer to the previous post on what this is doing, as the GPU counts is detailed there.

Create the dataset and fit a ML classifier

This is the same as before, albeit we are fitting a smaller dataset, as this will involve resampling and the length of the training can increase exponentially. Creating the dataset:

#------------------------------------------------------------------------------
# Create the dataset
#------------------------------------------------------------------------------
X, Y = make_classification(n_samples=1000,
                           n_features=20, 
                           n_informative=15,
                           n_redundant=5)

model = RandomForestClassifier(n_estimators=100, n_jobs=1)

I will fit a Random Forest with 100 decision trees. I also do not enact parallelism at this stage.

Create the function for optimising the cross validation scoring

The function hereunder will be explained in depth under the code block:

#------------------------------------------------------------------------------
# Build function to optimise cv_score
#------------------------------------------------------------------------------
def cross_val_optimised(model, X, Y, workers):
    # Define the evaluation procedure
    strat_cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3)
    #Start the timer
    start = time()
    model_result = cross_val_score(model, X, Y, scoring="accuracy",
                    cv=strat_cv, n_jobs=workers)
    end = time()
    result = end - start
    print("[MODEL EVALUATION INFO] run time is {:.3f}".format(result))
    return [result, model_result]

Benchmark our function with multiple workers (cores)

Finally, we are going to call the function every time we loop through the cores. This time I have used list comprehension to make this dynamic, as it will output a range of values from 1 to the number of CPUs you have on your machine:

#------------------------------------------------------------------------------
# Loop each call of function and append result to 
#------------------------------------------------------------------------------    
results_list = list()
# Create a list with number of cores
cores = [core for core in range(1, final_cpu+1)]
# Get the cores list dynamically, as previous example was hard coded
for n_cor in cores:
    cv_optim = cross_val_optimised(model, X, Y, workers=n_cor)
    result = cv_optim[0]
    results_list.append(result)

Running this code triggers the below and the appropriate message will be sent to the console:

Visualising the results

The last step is to work with matplotlib. This uses the function we created in the first tutorial, contained below for reference:

# Generate plot of results
def plot_results(x_val, y_val):
   plt.plot(x_val, y_val, color="blue", linestyle="--", marker="o")
   plt.xlabel("Number of cores")
   plt.ylabel("Run time (secs)")
   plt.show() 

To use this function:

  
plot_results(cores, results_list)Number of cores")
   plt.show() 

This produces the plot illustrated hereunder:

At 7 works performing the repeated cross-validation process, we can quickly perform this process.

The code for this section can be found on the associated GitHub.

Tuning those hyperparameters faster

This part of the tutorial looks at hyperparameter tuning in models. We are going to use the max_features attribute of the random forest to optimise. This process, especially via custom grid searches can be optimised and parallelized effectively. Here is how to become a parallel wizkid!

Setup, dataset creation and model training

Here, we will use the same settings and classification created in the above tutorial on model evaluation.

Setting up the Grid Search

This code will set up the grid search dictionary, as the hyperparameters in sci-kit learn are in dictionary data structures:

#------------------------------------------------------------------------------
# Train model on max number of CPUs
#------------------------------------------------------------------------------
model = RandomForestClassifier(n_estimators=100, n_jobs=final_cpu)
#------------------------------------------------------------------------------
# Grid search hyperparameter tuning
#------------------------------------------------------------------------------
# Do a K fold cross validation split
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3)
grid = dict()
grid['max_features'] = [1,2,3,4,5]

We create the cv variable that will be used for the resampling method. The grid is initialised as an empty dict() structure. I add a field called max_features onto the grid and I will specify how far to go down in each tree of the random forest.

Create the grid search function

The next step is to create our grid search function that will fit the model, perform tuning and time the whole process:

def grid_search_cv(model, grid, n_jobs, cv):
    print("[GRID SEARCH INFO] Starting the grid search")
    grid_search = GridSearchCV(model, grid, n_jobs=n_jobs, cv=cv)
    start = time()
    grid_search_fit = grid_search.fit(X, Y)
    end = time()
    finish_time = end - start
    print("[GRID SEARCH EXECUTION TIME] the search finished in: {:.3f} seconds.".format(finish_time))
    return [finish_time, grid_search, grid_search_fit]

To explain what the function is doing:

Performance comparison by the number of cores

We will now replicate the benchmarking steps we previously coded:

#------------------------------------------------------------------------------
# Benchmarking run time by depth of search and number of worker ants
#-----------------------------------------------------------------------------
results_list = list()
# Create a list with number of cores
cores = [core for core in range(1, final_cpu+1)]
# Get the cores list dynamically, as previous example was hard coded
#result = grid_search_cv(rf_model, grid=grid, n_jobs=2, cv=cv)
for n_cor in cores:
    search = grid_search_cv(rf_model, grid=grid, n_jobs=n_cor, cv=cv)
    result = search[0]
    results_list.append(result)

This performs the same routine as we have coded, the only differences is that it calls the grid search and runs the search by a different number of worker ants:

Visualising the Grid Search results

I am not going to reimplement the function we created at the start of the tutorial, but I am going to use it for this example:

plot_results(cores, results_list)

This gives you the plot of the grid tuning process:

The tutorial code for this part can be found on the associated GitHub.

To end this parallel madness

If you have a use case you need help with just reach out. I would be interested to help.

I hope you enjoyed these two tutorials. It has been:

To leave a comment for the author, please follow the link and comment on their blog: Python – Hutsons-hacks .

Want to share your content on python-bloggers? click here.
Exit mobile version