Precision, Recall & the F-1 Score
A big part of assessing the performance of logistical models is analyzing the evaluation metrics. In contrast to linear regression, where error was determined by how far estimates were off from actuals, with classification modeling you’re either correct or incorrect. Due to this distinction, when evaluating your model, it’s critical to look into these evaluation metrics.
In this post, we’ll touch upon the 4 major evaluation metrics but mainly focus on Precision and Recall. We’ll dive into what these metrics calculate, how they can influence other important metrics and the circumstances when one could be more important than the other.
There are a variety of evaluation metrics that are used to assess a logistical model such as accuracy, precision, reclass and F1 score. The following definitions are built around terminology of a confusion matrix.
- Accuracy — arguably the most intuitive metric. Out of all the observations, how many true positives and true negatives did our model predict. Now this is very informative but should not be used alone when evaluating a model. A model that has serious class imbalance for example: 1,000 observations, 999 correct, 1 incorrect would rate as 99.9% accurate even if it did not attempt to identify the incorrect observation. This could be a detrimental, for example, if our model was predicting rare, deadly diseases.
- Precision — also referred to as the positive predictive value, calculates how well the model identifies true positives (TP) out of the number of predicted positives (TP + FP).
- Recall — also known as Sensitivity, calculates how well the model identifies true positives (TP) out of the actual number of total positives (TP + FN).
- F-1 Score — this metric represents the ‘harmonic mean of Precision and Recall.’ It is calculated below using both Precision and Recall. A high Recall & Precision score leads to a high F-1 score and model’s high F-1 score proves that it is doing well all around.
What Metric is best?
This question entirely depends on the business use case or goals for the model. This shows the importance of understanding what each of these metrics are calculating.
Typically, its best practice to calculate all relevant metrics. This allows you to compare how different classifiers are performing against each other. Though the following are examples of when you might choose to focus more on Precision or Recall depending on how false positives/negatives affect your outcomes.
A traditional example of when Recall is more important is cancer diagnoses. This is a case where false positives are far more acceptable than false negatives. It’s ok if you diagnose someone with cancer when they are healthy (false positive) as with further testing they’ll be found healthy. It’s far more disastrous to diagnose someone as healthy when they do in fact have cancer (false negative.)
Spam email detection — this is a case where precision is more important than Recall or when false negatives are more acceptable than false positives. It’s not as big of a deal finding the occasional spam email in your inbox (false negative.) An important email being classified as spam (false positive) can become a big issue.
Having high-values in both Precision and Reclass is very difficult in practice. You will often need to determine which is more important for your business case as increasing one will decrease the other. Hence, the importance of understanding what each of these metrics are actually calculating and how it relates to your model.
I hope this short post helps clarify the following:
- How to calculate Accuracy, Precision, Recall and F-1
- What Accuracy, Precision, Recall and F-1 are calculating
- How Precision and Recall impact a model’s F-1 Score
- When is it better to focus on Recall vs Precision
This is only the tip of the iceberg when it comes to model evaluation, next steps would be to start evaluating the ROC & Precision-Recall curves.