Top 3 Classification Machine Learning Metrics – Ditch Accuracy Once and For All

This article was first published on python – Better Data Science , and kindly contributed to python-bloggers. (You can report issue about the content on this page here)
Want to share your content on python-bloggers? click here.

Accuracy can be misleading, especially with class-imbalanced datasets. That’s why you should replace it with a more robust metric. Today you’ll learn three of them – and implement them from scratch.

Here’s what you’ll learn today:

Why accuracy sucks

Let’s say you’ve evaluated your models only with accuracy so far. You know that top-left and bottom-right values in a confusion matrix should be high, and the other two should be low. 

But what do these numbers mean? What’s wrong with good ol’ accuracy?

Just imagine you’re trying to classify terrorists from face images. Let’s say that out of 1.000.000 people, 10 are terrorists. If you were to build a dummy model that classifies every image as non-terrorist, you would have a 99.999% accurate model!

Please don’t put “I’ve built SOTA terrorist classification models” on your resume just yet. Accuracy can be misleading.

The goal of your model should be to correctly classify terrorists every time. And it’s managed to do so exactly 0 times. 

In other words – the recall value is zero. And your model sucks.

Confusion matrix crash course

Before diving deep into these metrics, let’s make a quick refresher on the confusion matrix. Here’s how it generally looks like:

Confusion matrix

Image 1 – Confusion matrix (image by author)

Let’s make it a bit less abstract. I’ve gone and trained a wine classifier model and obtained a confusion matrix. Here’s how it looks like:

Confusion matrix with real data

Image 2 – Confusion matrix with real data (image by author)

Is this good? Who knows.

Accuracy is around 88%, but that doesn’t necessarily mean anything. That’s where precision, recall, and F-beta metrics come into play.

Precision

In the most simple words, precision is a metric that shows you the number of correct positive predictions. It is calculated as the number of true positives divided by the sum of true positives and false positives:

Precision formula

Image 3 – Precision formula (image by author)

Still a bit confusing? Continue reading.

You know what a true positive is – an instance that was actually positive, and the model classified it as positive (good wine classified as a good wine). But what are false positives? Put simply, an instance that’s negative but classified as positive (bad wine classified as good).

Here’s a more alarming example of false positives: a patient doesn’t have cancer, but the doctor says he has.

Back to the wine example. You can calculate the precision score from the formula mentioned above. Here’s a complete walkthrough:

Precision calculation

Image 4 – Precision calculation (image by author)

So, around 0.84. Both precision and recall range from 0 to 1 (higher is better), so this value seems to be pretty good.

In other words – your model doesn’t produce a lot of false positives.

You now know what precision is, but what the heck is recall? Let’s demystify that next.

Recall

Recall might be the most useful metric for many classification problems. It tells you the number of correct positive predictions made out of all positive predictions. It is calculated as the number of true positives divided by the sum of true positives and false negatives:

Precision formula

Image 5 – Recall formula (image by author)

If you’re even remotely like me, it’s a chance you’ll find the above definition a bit abstract. 

Here’s how to apply it to classifying wines: Out of all good wines, how many did you classify correctly?

This is where you need to know what false negatives are. A false negative is a positive instance classified as negative. Sure, it’s all fun and games when classifying wines, but what about a more serious scenario?

In our earlier medical example, false negative means the following: a patient has cancer, but the doctor says he doesn’t.

As you can see, false negatives can sometimes be more costly than false positives. It’s essential to recognize which one is more important for your problem.

Back to the wine example. You can calculate the recall score from the formula mentioned above. Here’s a complete walkthrough:

Precision calculation

Image 6 – Recall calculation (image by author)

Just as precision, recall also ranges between 0 and 1 (higher is better). 0.61 isn’t that great.

In other words – your model produces a decent amount of false negatives.

But what if you want both precision and recall to be somewhat decent? Then you’ll fall in love with the F-Beta metric.

F-Beta measure

F-measure provide you with some balance between precision and recall. The default F-measure is the F1, which tries not to favor either of the two previously discussed metrics.

Here’s the formula for calculating the F1 score:

F1-measure formula

Image 7 – F1-measure formula (image by author)

As you can see, to calculate F1 you need to know the values for precision and recall beforehand. Here’s the full calculation walkthrough for our example:

F1-measure calculation

Image 8 – F1 calculation (image by author)

But what the deal with the beta parameter? 

During the F-score calculation, you can emphasize recall or precision by altering the beta parameter. Here’s how the more generalized formula for calculating F-scores looks like:

F-beta measure formula

Image 9 – F-beta measure formula (image by author)

If beta is 1, then you’re calculating the F1 score and can simplify the formula to the one seen earlier in this section.

Here is the general rule of thumb for selecting the best value for the beta:

  • Beta = 0.5 (F0.5-measure): You want a balance between precision and recall, with more weight on precision
  • Beta = 1 (F1-measure): You want a pure balance between precision and recall
  • Beta = 2 (F2-measure): You want a balance between precision and recall, with more weight on recall

To simplify, you can calculate F0.5-measure with the following formula:

F0.5 measure formula

Image 10 – F0.5 measure formula (image by author)

And F2-measure with this one:

F2 measure formula

Image 11 – F2 measure formula (image by author)

These values for the beta aren’t set in stone, so feel free to experiment, depending on the problem you’re solving.

Conclusion

In a nutshell – accuracy can be misleading. Be careful when using it. If predicting positives and negatives is equally important, and both classes are balanced equally, accuracy can still be useful.

That’s not the case most of the time. Take your time to study the dataset and the problem and decide what’s more important to you – lesser false positives or lesser false negatives.

Metric selection is a joke from that point.

Which metric(s) do you use for classification problems? Let me know in the comment section.

Join my private email list for more helpful insights.

 

The post Top 3 Classification Machine Learning Metrics – Ditch Accuracy Once and For All appeared first on Better Data Science.

To leave a comment for the author, please follow the link and comment on their blog: python – Better Data Science .

Want to share your content on python-bloggers? click here.