You’d be surprised at how many data scientists don’t know how to turn their probabilities into class labels. Often times they will just go with 50% as the cutoff without a second thought, regardless of any class imbalances or asymmetric costs in the outcomes. I’ve even spoken to data scientists in the healthcare industry where predicting events such as “has disease” versus “does not have disease” came without utility thresholds. For some disease predictions it might not matter much, but for others, how can you possibly use an arbitrary threshold knowing that the cost of a false positive will have significantly different effects on a patient compared to a false negative? You should be applying either the threshOptim()
function or the RedYellowGreen()
function from the RemixAutoML package in R for situations like this.
What threshold should you use to classify your predicted probabilities?
Okay, so you spent time building out an awesome classification model. You are seeing a great AUC compared to previous versions. Now what? Your product manager asks you what threshold to use for classifying your predicted probabilities. How do you answer that?
You should know how to answer this question. There are several methods you can use. H2O, for example, offers several which may be useful for you to know. Those are:

 max f1 = 2*TP / (2*TP + FP + FN), is the harmonic mean of precision and sensitivity.
 max f2 = The F2 measure weights the recall higher than the precision (i.e. penalizes large number of false negatives)
 max f0point5 = The F_{0.5} measure puts more emphasis on the precision than the recall (penalizes large number of false positives)
 max accuracy = (TP + TN) / (P + N), how many across both P and N do I identify correctly
 max precision = TP / (TP + FP), precision in positives, how many false positives am I affected with
 max recall = TP / P = TP / (TP+FN), how many of the positives do you identify correctly
 max specificity = TN / N = TN / (FP+TN), how many of the negatives do you identify correctly
 max absolute_MCC = Balanced measure which can be used even if the classes are of very different sizes. It returns a value between −1 and +1. A coefficient of +1 represents a perfect prediction, 0 no better than random prediction and −1 indicates total disagreement between prediction and observation. While there is no perfect way of describing the confusion matrix of true and false positives and negatives by a single number, the Matthews correlation coefficient is generally regarded as being one of the best such measures.
 max min_per_class_accuracy = the threshold is such that, of all other thresholds, this one leads to the maximum class accuracy for the minimum class accuracy of the set of accuracies
The Correct Thresholds to Generate
Okay, those sound technical, but which one do you use to optimize asymmetrical costs and profits for correct predictions and Type 1 and Type 2 errors? Let’s say that the payoff matrix looks like the one below. H2O defaults to max f1 which will typically be sufficient for most cases but they also offer F2 for penalizing a large number of false negatives and f0point5 for penalizing a large number of false positives. Those measure get you closer to where we want to be but why not be precise with optimizing the threshold?
If your confusion matrix looks something like the below table, such that it’s not comprised of 1’s for correct predictions nor is it comprised of 0’s for incorrect predictions (default values), then you should use be using thethreshOptim()
and RedYellowGreen()
functions in the RemixAutoML package for R.
Actual \ Predicted  Positive Prediction  Negative Prediction 
Positive Outcome  0.0  15 
Negative Outcome  4.0  0.0 
The threshOptim()
function utilizes the costs in the confusion matrix to determine a single optimal threshold based on the threshold that maximizes utility. For cases when uncertain probability predictions should warrant further analysis by a medical professional, you should use the RedYellowGreen()
function. The function is designed to allow you to plug in not only the costs of a false positive / false negative but also the cost of further analysis, thus providing two thresholds. Any predicted probability that falls in between the two thresholds should be sent for review while the predicted probabilities that fall below the lower threshold should be an obvious negative outcome and those above the upper threshold should be obvious cases of a positive outcome.
Below is a sample plot output in R from RemixAutoML::RedYellowGreen()
that is automatically generating from running it. The lower threshold is 0.32 and the upper threshold is 0.34. If you generate a predicted probability of 0.33, you would send that instance for further review.
Utility calculation for threshOptim()
Utility calculation for RedYellowGreen