Parallelisation of Model Evaluation and Hyperparameter Tuning in Sci Kit Learn

Gary Hutson

4 years ago

This article was first published on Python – Hutsons-hacks , and kindly contributed to python-bloggers. (You can report issue about the content on this page here)
Want to share your content on python-bloggers? click here.

Hello, it is me again for another post on how to make sci-kit perform at the top of its game.

Amping up the Model Evaluation process

Model evaluation in sci-kit learn can be achieved by the cross_val_score. This performs repeated stratified K fold resampling and assess the model accuracy across each of those folds. This is done to get a better sense of model representation in the testing data. This is done instead of a simple hold out split, as that is a one-off sample of what the accuracy of the model in the wild is. Normally, these resamples are averaged. However, I am coming at this from the point of optimising the speed of the resamples.

Bring in the necessary imports and available CPUs

This part of the code will look similar to the post on optimising model training:

from time import time
from sklearn.datasets import make_classification
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
import os
import matplotlib.pyplot as plt


# Get CPU cores
cpu_count = os.cpu_count()
print(f"This machine has {cpu_count} cores")
reserved_cpu = 1
final_cpu = int(cpu_count - reserved_cpu)
print("Saving one CPU so my PC does not lock up")

Please refer to the previous post on what this is doing, as the GPU counts is detailed there.

Create the dataset and fit a ML classifier

This is the same as before, albeit we are fitting a smaller dataset, as this will involve resampling and the length of the training can increase exponentially. Creating the dataset:

#------------------------------------------------------------------------------
# Create the dataset
#------------------------------------------------------------------------------
X, Y = make_classification(n_samples=1000,
                           n_features=20, 
                           n_informative=15,
                           n_redundant=5)

model = RandomForestClassifier(n_estimators=100, n_jobs=1)

I will fit a Random Forest with 100 decision trees. I also do not enact parallelism at this stage.

Create the function for optimising the cross validation scoring

The function hereunder will be explained in depth under the code block:

#------------------------------------------------------------------------------
# Build function to optimise cv_score
#------------------------------------------------------------------------------
def cross_val_optimised(model, X, Y, workers):
    # Define the evaluation procedure
    strat_cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3)
    #Start the timer
    start = time()
    model_result = cross_val_score(model, X, Y, scoring="accuracy",
                    cv=strat_cv, n_jobs=workers)
    end = time()
    result = end - start
    print("[MODEL EVALUATION INFO] run time is {:.3f}".format(result))
    return [result, model_result]

strat_cv – states that we want to undertake repeated stratified K fold resampling – the n_splits relates to the K parameter of the folds and the number of repeats says I want 10 splits, but I want to repeat each 3 times.
start – this is the current time and will stored when this variable is assigned and created
model_result – this is the main part of the function and this is the function we are going to make parallel:
- model parameter – this is the model we fitted prior
- X – the independent variables dataset, or array
- Y – the dependent variable array
- cv – this is the strat_cv method
- n_jobs = workers – this will be what we iterate over to look at the performance of the models
end – the end time
result – how long the model ran in seconds. For long processes this might need to be converted to minutes/hours to provide more sensible feedback
The function prints out the run time of the model
The return statement outputs a list containing the result i.e. how long the model took to run and the model_result that is actually the specific fitted model.

Benchmark our function with multiple workers (cores)

Finally, we are going to call the function every time we loop through the cores. This time I have used list comprehension to make this dynamic, as it will output a range of values from 1 to the number of CPUs you have on your machine:

#------------------------------------------------------------------------------
# Loop each call of function and append result to 
#------------------------------------------------------------------------------    
results_list = list()
# Create a list with number of cores
cores = [core for core in range(1, final_cpu+1)]
# Get the cores list dynamically, as previous example was hard coded
for n_cor in cores:
    cv_optim = cross_val_optimised(model, X, Y, workers=n_cor)
    result = cv_optim[0]
    results_list.append(result)

results_list – an empty list initialised
cores – list comprehension through a range of values from 1 to the number of CPUs we have on our machine
The loop:
- creates an object called cv_optim. This is the model called each time the loop iterates and the number of workers is incremented by one
- result – this is the run time from the list at index 0
- The final result gets appended to the empty list until the loop terimates

Running this code triggers the below and the appropriate message will be sent to the console:

Visualising the results

The last step is to work with matplotlib. This uses the function we created in the first tutorial, contained below for reference:

# Generate plot of results
def plot_results(x_val, y_val):
   plt.plot(x_val, y_val, color="blue", linestyle="--", marker="o")
   plt.xlabel("Number of cores")
   plt.ylabel("Run time (secs)")
   plt.show()

To use this function:

  
plot_results(cores, results_list)Number of cores")
   plt.show()

This produces the plot illustrated hereunder:

At 7 works performing the repeated cross-validation process, we can quickly perform this process.

The code for this section can be found on the associated GitHub.

Tuning those hyperparameters faster

This part of the tutorial looks at hyperparameter tuning in models. We are going to use the max_features attribute of the random forest to optimise. This process, especially via custom grid searches can be optimised and parallelized effectively. Here is how to become a parallel wizkid!

Setup, dataset creation and model training

Here, we will use the same settings and classification created in the above tutorial on model evaluation.

Setting up the Grid Search

This code will set up the grid search dictionary, as the hyperparameters in sci-kit learn are in dictionary data structures:

#------------------------------------------------------------------------------
# Train model on max number of CPUs
#------------------------------------------------------------------------------
model = RandomForestClassifier(n_estimators=100, n_jobs=final_cpu)
#------------------------------------------------------------------------------
# Grid search hyperparameter tuning
#------------------------------------------------------------------------------
# Do a K fold cross validation split
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3)
grid = dict()
grid['max_features'] = [1,2,3,4,5]

We create the cv variable that will be used for the resampling method. The grid is initialised as an empty dict() structure. I add a field called max_features onto the grid and I will specify how far to go down in each tree of the random forest.

Create the grid search function

The next step is to create our grid search function that will fit the model, perform tuning and time the whole process:

def grid_search_cv(model, grid, n_jobs, cv):
    print("[GRID SEARCH INFO] Starting the grid search")
    grid_search = GridSearchCV(model, grid, n_jobs=n_jobs, cv=cv)
    start = time()
    grid_search_fit = grid_search.fit(X, Y)
    end = time()
    finish_time = end - start
    print("[GRID SEARCH EXECUTION TIME] the search finished in: {:.3f} seconds.".format(finish_time))
    return [finish_time, grid_search, grid_search_fit]

To explain what the function is doing:

the first step is to print out the grid search has been started
grid_search – this is the driver of the model and uses the model, the grid we created in the previous step to tune on max features in the random forest, this would change dependent on the tuning routine
start – the timer initialisation of when the start variable was set
grid_search_fit – fit the grid_search method to the features and the predicted variable
end – the final time stamp at the end of the process
finish_time – the elapsed time when running the routine
prints out the time of execution to the console
returns the grid_search object and the finish_time in seconds. These are stored in a list structure.

Performance comparison by the number of cores

We will now replicate the benchmarking steps we previously coded:

#------------------------------------------------------------------------------
# Benchmarking run time by depth of search and number of worker ants
#-----------------------------------------------------------------------------
results_list = list()
# Create a list with number of cores
cores = [core for core in range(1, final_cpu+1)]
# Get the cores list dynamically, as previous example was hard coded
#result = grid_search_cv(rf_model, grid=grid, n_jobs=2, cv=cv)
for n_cor in cores:
    search = grid_search_cv(rf_model, grid=grid, n_jobs=n_cor, cv=cv)
    result = search[0]
    results_list.append(result)

This performs the same routine as we have coded, the only differences is that it calls the grid search and runs the search by a different number of worker ants: