When building a ML model, you must designate an evaluation metric, which tells the algorithm what you’re optimizing for. One commonly used evaluation metric is accuracy, that is, what percentage of your data your model makes the correct prediction for. This may seem like a great choice: who would want a model that isn’t the most accurate?
Actually, there are many cases where you wouldn’t want to optimize for accuracy—the most prevalent being when your data has imbalanced classes. Say you’re building a spam filter to classify emails as spam or not, and only 1% of emails are actually spam (this is what is meant by imbalanced classes: 1% of the data is spam, 99% is not). Then a model that classifies all emails as non-spam has an accuracy of 99%, which sounds great, but is a meaningless model.
There are alternative metrics that account for such class imbalances. It is key that you speak with your data scientists about what they’re optimizing for and how it relates to your business question. A good place to start these discussions is not by focusing on a single metric but by looking at what’s called the confusion matrix of the model, which contains the following numbers:
- False negatives (e.g., real spam incorrectly classified as non-spam)
- False positives (non-spam incorrectly classified as spam)
- True negatives (non-spam correctly classified)
- True positives (spam correctly classified)
Source: Glass Box Medicine
A lot of attention is currently focused on the importance of the data you feed your ML models and how it relates to your evaluation metric. YouTube had to learn this the hard way: When they optimized for revenue based on view time (how long people stay glued to videos), this had the negative effect of recommending more violent and incendiary content, along with more conspiracy videos and fake news.
An interesting lesson here is that optimizing for revenue—since viewing time is correlated with the number of ads YouTube can serve you, and thus, revenue—may not be aligned with other goals, such as showing truthful content. This is an algorithmic version of Goodhart’s Law, which states: “When a measure becomes a target, it ceases to be a good measure.”
The most well-known example is a Soviet nail factory, in which the workers were first given a target of a number of nails and produced many small nails. To counter this, the target was altered to the total weight of the nails, so they then made a few giant nails. But algorithms also fall prey to Goodhart’s law, as we’ve seen with the YouTube recommendation system.
Find out more about best practices for machine learning in The Definitive Guide to Machine Learning for Business Leaders.