Want to share your content on python-bloggers? click here.
Hello, it is me again for another post on how to make sci-kit perform at the top of its game.
Amping up the Model Evaluation process
Model evaluation in sci-kit learn can be achieved by the cross_val_score. This performs repeated stratified K fold resampling and assess the model accuracy across each of those folds. This is done to get a better sense of model representation in the testing data. This is done instead of a simple hold out split, as that is a one-off sample of what the accuracy of the model in the wild is. Normally, these resamples are averaged. However, I am coming at this from the point of optimising the speed of the resamples.
Bring in the necessary imports and available CPUs
This part of the code will look similar to the post on optimising model training:
from time import time from sklearn.datasets import make_classification from sklearn.model_selection import RepeatedStratifiedKFold from sklearn.ensemble import RandomForestClassifier from sklearn.model_selection import cross_val_score import os import matplotlib.pyplot as plt # Get CPU cores cpu_count = os.cpu_count() print(f"This machine has {cpu_count} cores") reserved_cpu = 1 final_cpu = int(cpu_count - reserved_cpu) print("Saving one CPU so my PC does not lock up")
Please refer to the previous post on what this is doing, as the GPU counts is detailed there.
Create the dataset and fit a ML classifier
This is the same as before, albeit we are fitting a smaller dataset, as this will involve resampling and the length of the training can increase exponentially. Creating the dataset:
#------------------------------------------------------------------------------ # Create the dataset #------------------------------------------------------------------------------ X, Y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5) model = RandomForestClassifier(n_estimators=100, n_jobs=1)
I will fit a Random Forest with 100 decision trees. I also do not enact parallelism at this stage.
Create the function for optimising the cross validation scoring
The function hereunder will be explained in depth under the code block:
#------------------------------------------------------------------------------ # Build function to optimise cv_score #------------------------------------------------------------------------------ def cross_val_optimised(model, X, Y, workers): # Define the evaluation procedure strat_cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3) #Start the timer start = time() model_result = cross_val_score(model, X, Y, scoring="accuracy", cv=strat_cv, n_jobs=workers) end = time() result = end - start print("[MODEL EVALUATION INFO] run time is {:.3f}".format(result)) return [result, model_result]
- strat_cv – states that we want to undertake repeated stratified K fold resampling – the n_splits relates to the K parameter of the folds and the number of repeats says I want 10 splits, but I want to repeat each 3 times.
- start – this is the current time and will stored when this variable is assigned and created
- model_result – this is the main part of the function and this is the function we are going to make parallel:
- model parameter – this is the model we fitted prior
- X – the independent variables dataset, or array
- Y – the dependent variable array
- cv – this is the strat_cv method
- n_jobs = workers – this will be what we iterate over to look at the performance of the models
- end – the end time
- result – how long the model ran in seconds. For long processes this might need to be converted to minutes/hours to provide more sensible feedback
- The function prints out the run time of the model
- The return statement outputs a list containing the result i.e. how long the model took to run and the model_result that is actually the specific fitted model.
Benchmark our function with multiple workers (cores)
Finally, we are going to call the function every time we loop through the cores. This time I have used list comprehension to make this dynamic, as it will output a range of values from 1 to the number of CPUs you have on your machine:
#------------------------------------------------------------------------------ # Loop each call of function and append result to #------------------------------------------------------------------------------ results_list = list() # Create a list with number of cores cores = [core for core in range(1, final_cpu+1)] # Get the cores list dynamically, as previous example was hard coded for n_cor in cores: cv_optim = cross_val_optimised(model, X, Y, workers=n_cor) result = cv_optim[0] results_list.append(result)
- results_list – an empty list initialised
- cores – list comprehension through a range of values from 1 to the number of CPUs we have on our machine
- The loop:
- creates an object called cv_optim. This is the model called each time the loop iterates and the number of workers is incremented by one
- result – this is the run time from the list at index 0
- The final result gets appended to the empty list until the loop terimates
Running this code triggers the below and the appropriate message will be sent to the console:
Visualising the results
The last step is to work with matplotlib. This uses the function we created in the first tutorial, contained below for reference:
# Generate plot of results def plot_results(x_val, y_val): plt.plot(x_val, y_val, color="blue", linestyle="--", marker="o") plt.xlabel("Number of cores") plt.ylabel("Run time (secs)") plt.show()
To use this function:
plot_results(cores, results_list)Number of cores") plt.show()
This produces the plot illustrated hereunder:
At 7 works performing the repeated cross-validation process, we can quickly perform this process.
The code for this section can be found on the associated GitHub.
Tuning those hyperparameters faster
This part of the tutorial looks at hyperparameter tuning in models. We are going to use the max_features attribute of the random forest to optimise. This process, especially via custom grid searches can be optimised and parallelized effectively. Here is how to become a parallel wizkid!
Setup, dataset creation and model training
Here, we will use the same settings and classification created in the above tutorial on model evaluation.
Setting up the Grid Search
This code will set up the grid search dictionary, as the hyperparameters in sci-kit learn are in dictionary data structures:
#------------------------------------------------------------------------------ # Train model on max number of CPUs #------------------------------------------------------------------------------ model = RandomForestClassifier(n_estimators=100, n_jobs=final_cpu) #------------------------------------------------------------------------------ # Grid search hyperparameter tuning #------------------------------------------------------------------------------ # Do a K fold cross validation split cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3) grid = dict() grid['max_features'] = [1,2,3,4,5]
We create the cv variable that will be used for the resampling method. The grid is initialised as an empty dict() structure. I add a field called max_features onto the grid and I will specify how far to go down in each tree of the random forest.
Create the grid search function
The next step is to create our grid search function that will fit the model, perform tuning and time the whole process:
def grid_search_cv(model, grid, n_jobs, cv): print("[GRID SEARCH INFO] Starting the grid search") grid_search = GridSearchCV(model, grid, n_jobs=n_jobs, cv=cv) start = time() grid_search_fit = grid_search.fit(X, Y) end = time() finish_time = end - start print("[GRID SEARCH EXECUTION TIME] the search finished in: {:.3f} seconds.".format(finish_time)) return [finish_time, grid_search, grid_search_fit]
To explain what the function is doing:
- the first step is to print out the grid search has been started
- grid_search – this is the driver of the model and uses the model, the grid we created in the previous step to tune on max features in the random forest, this would change dependent on the tuning routine
- start – the timer initialisation of when the start variable was set
- grid_search_fit – fit the grid_search method to the features and the predicted variable
- end – the final time stamp at the end of the process
- finish_time – the elapsed time when running the routine
- prints out the time of execution to the console
- returns the grid_search object and the finish_time in seconds. These are stored in a list structure.
Performance comparison by the number of cores
We will now replicate the benchmarking steps we previously coded:
#------------------------------------------------------------------------------ # Benchmarking run time by depth of search and number of worker ants #----------------------------------------------------------------------------- results_list = list() # Create a list with number of cores cores = [core for core in range(1, final_cpu+1)] # Get the cores list dynamically, as previous example was hard coded #result = grid_search_cv(rf_model, grid=grid, n_jobs=2, cv=cv) for n_cor in cores: search = grid_search_cv(rf_model, grid=grid, n_jobs=n_cor, cv=cv) result = search[0] results_list.append(result)
This performs the same routine as we have coded, the only differences is that it calls the grid search and runs the search by a different number of worker ants:
Visualising the Grid Search results
I am not going to reimplement the function we created at the start of the tutorial, but I am going to use it for this example:
plot_results(cores, results_list)
This gives you the plot of the grid tuning process:
The tutorial code for this part can be found on the associated GitHub.
To end this parallel madness
If you have a use case you need help with just reach out. I would be interested to help.
I hope you enjoyed these two tutorials. It has been:
Want to share your content on python-bloggers? click here.