3 Essential Ways to Calculate Feature Importance in Python

[This article was first published on python – Better Data Science, and kindly contributed to python-bloggers]. (You can report issue about the content on this page here)
Want to share your content on python-bloggers? click here.

How can you find the most important features in your dataset? There’s a ton of techniques, and this article will teach you three any data scientist should know.

After reading, you’ll know how to calculate feature importance in Python with only a couple of lines of code. You’ll also learn the prerequisites of these techniques – crucial to making them work properly.

You can download the Notebook for this article here.

The article is structured as follows:

Dataset loading and preparation

Let’s spend as little time as possible here. You’ll use the Breast cancer dataset, which is built into Scikit-Learn. You’ll also need Numpy, Pandas, and Matplotlib for various analysis and visualization purposes. 

The following snippet shows you how to import the libraries and load the dataset:

The dataset isn’t in the most convenient format now. You’ll work with Pandas data frames most of the time, so let’s quickly convert it into one. The following snippet concatenates predictors and the target variable into a single data frame:

Calling head() results in the following output:

Image 1 - Head of Breast cancer dataset (image by author)

Image 1 – Head of Breast cancer dataset (image by author)

In a nutshell, there are 30 predictors and a single target variable. All of the values are numeric, and there are no missing values. The only obvious problem is the scale. Just take a look at the mean area and mean smoothness columns – the differences are drastic, which could result in poor models.

You’ll also need to perform a train/test split before addressing the scaling issue.

The following snippet shows you how to make a train/test split and scale the predictors with the StandardScaler class:

And that’s all you need to start obtaining feature importances. Let’s do that next.

Method #1 – Obtain importances from coefficients

Probably the easiest way to examine feature importances is by examining the model’s coefficients. For example, both linear and logistic regression boils down to an equation in which coefficients (importances) are assigned to each input value. 

Put simply, if an assigned coefficient is a large (negative or positive) number, it has some influence on the prediction. On the contrary, if the coefficient is zero, it doesn’t have any impact on the prediction.

Simple logic, but let’s put it to the test. We have a classification dataset, so logistic regression is an appropriate algorithm. After the model is fitted, the coefficients are stored in the coef_ property.

The following snippet trains the logistic regression model, creates a data frame in which the attributes are stored with their respective coefficients, and sorts that data frame by the coefficient in descending order:

That was easy, wasn’t it? Let’s examine the coefficients visually next. The following snippet makes a bar chart from coefficients:

Here’s the corresponding visualization:

Image 2 - Feature importances as logistic regression coefficients (image by author)

Image 2 – Feature importances as logistic regression coefficients (image by author)

And that’s all there is to this simple technique. A take-home point is that the larger the coefficient is (in both positive and negative direction), the more influence it has on a prediction.

Method #2 – Obtain importances from a tree-based model

After training any tree-based models, you’ll have access to the feature_importances_ property. It’s one of the fastest ways you can obtain feature importances.

The following snippet shows you how to import and fit the XGBClassifier model on the training data. The importances are obtained similarly as before – stored to a data frame which is then sorted by the importance:

You can examine the importance visually by plotting a bar chart. Here’s how to make one:

The corresponding visualization is shown below:

Image 3 - Feature importances obtained from a tree-based model (image by author)

Image 3 – Feature importances obtained from a tree-based model (image by author)

As mentioned earlier, obtaining importances in this way is effortless, but the results can come up a bit biased. The tendency of this approach is to inflate the importance of continuous features or high-cardinality categorical variables[1]. Make sure to do the proper preparation and transformations first, and you should be good to go.

Method #3 – Obtain importances from PCA loading scores

Principal Component Analysis (PCA) is a fantastic technique for dimensionality reduction, and can also be used to determine feature importance.

PCA won’t show you the most important features directly, as the previous two techniques did. Instead, it will return N principal components, where N equals the number of original features. 

If you’re a bit rusty on PCA, there’s a complete from-scratch guide at the end of this article.

To start, let’s fit PCA to our scaled data and see what happens. The following snippet does just that and also plots a line plot of the cumulative explained variance:

Here’s the corresponding visualization:

Image 4 - PCA cumulative explained variance (image by author)

Image 4 – PCA cumulative explained variance (image by author)

But what does this mean? It means you can explain 90-ish% of the variance in your source dataset with the first five principal components. Again, refer to the from-scratch guide if you don’t know what this means.

You can now start dealing with PCA loadings. These are just coefficients of the linear combination of the original variables from which the principal components are constructed[2]. You can use loadings to find correlations between actual variables and principal components.

If there’s a strong correlation between the principal component and the original variable, it means this feature is important – to say with the simplest words.

Here’s the snippet for computing loading scores with Python:

The corresponding data frame looks like this:

Image 5 - Head of PCA loading scores (image by author)

Image 5 – Head of PCA loading scores (image by author)

The first principal component is crucial. It’s just a single feature, but it explains over 60% of the variance in the dataset. As you can see from Image 5, the correlation coefficient between it and the mean radius feature is almost 0.8 – which is considered a strong positive correlation.

Let’s visualize the correlations between all of the input features and the first principal components. Here’s the entire code snippet (visualization included):

The corresponding visualization is shown below:

Image 6 - PCA loading scores from the first principal component (image by author)

Image 6 – PCA loading scores from the first principal component (image by author)

And that’s how you can “hack” PCA to use it as a feature importance algorithm. Let’s wrap things up in the next section.

Conclusion

And there you have it – three techniques you can use to find out what matters. Of course, there are many others, and you can find some of them in the Learn more section of this article.

These three should suit you well for any machine learning task. Just make sure to do the proper cleaning, exploration, and preparation first.

Thanks for reading.

Join my private email list for more helpful insights.

Learn more

References

 

The post 3 Essential Ways to Calculate Feature Importance in Python appeared first on Better Data Science.

To leave a comment for the author, please follow the link and comment on their blog: python – Better Data Science.

Want to share your content on python-bloggers? click here.