ICE and Partial Dependence Plots in LIME fail to tell me the accuracy surrounding the fitted relationship. Further, ICE doesn’t exactly tell me the probability of each of the lines occurring. Your model can overfit or underfit your data pretty easily, especially if you are using deep learning models. LIME (should be called LAME) fails to tell me how the model actually performs.
Imagine you are working on a price elasticity model that will guide pricing decisions. Currently you would show the relationship that the model was able to fit. Given that we will be using a model to guide pricing decisions, a sensible stakeholder might ask, “I see the relationship that your model fit, but how do I know that corresponds to the actual relationship?”
What do you do? Give the stakeholder some model accuracy metrics? Tell them that you used deep learning so they should just trust it because it is state-of-the-art technology?
Here is a simple solution to the shortfall of partial dependence plots: Use calibration on your predicted relationship. It’s that simple. Below is an example plot from the RemixAutoML package in R. The x-axis is the independent variable of interest. The spacing between ticks are based on percentiles of the distribution of the independent variable. What that means is that, across the x-axis, the data is uniformly distributed, so no need for the dashes as shown above in the ICE chart. Secondly, we can see the relationship of the independent variable as it relates to the target variable, as does the partial dependence plots, but we can also see how good a fit the model has across the range of the independent variable. This addresses the skepticism from your stakeholders about the accuracy of your predictions. If you want to see the variability of your predictions, use the boxplot version too. If you want to see the relationship for specific group, simply subset your data so only that group of interest is included, and rerun the function.
####################################################### # Create data to simulate validation data with predicted values ####################################################### # Correl: This is the correlation used to determine how correlated the variables are to # the target variable. Switch it up (between 0 and 1) to see how the charts below change. Correl <- 0.85 data <- data.table::data.table(Target = runif(1000)) # Mock independent variables - they are correlated variables with # various transformations so you can see different kinds of relationships # in the charts below # Helper columns for creating simulated variables data[, x1 := qnorm(Target)] data[, x2 := runif(1000)] # Create one variable at a time data[, Independent_Variable1 := log(pnorm(Correl * x1 + sqrt(1-Correl^2) * qnorm(x2)))] data[, Independent_Variable2 := (pnorm(Correl * x1 + sqrt(1-Correl^2) * qnorm(x2)))] data[, Independent_Variable3 := exp(pnorm(Correl * x1 + sqrt(1-Correl^2) * qnorm(x2)))] data[, Independent_Variable4 := exp(exp(pnorm(Correl * x1 + sqrt(1-Correl^2) * qnorm(x2))))] data[, Independent_Variable5 := sqrt(pnorm(Correl * x1 + sqrt(1-Correl^2) * qnorm(x2)))] data[, Independent_Variable6 := (pnorm(Correl * x1 + sqrt(1-Correl^2) * qnorm(x2)))^0.10] data[, Independent_Variable7 := (pnorm(Correl * x1 + sqrt(1-Correl^2) * qnorm(x2)))^0.25] data[, Independent_Variable8 := (pnorm(Correl * x1 + sqrt(1-Correl^2) * qnorm(x2)))^0.75] data[, Independent_Variable9 := (pnorm(Correl * x1 + sqrt(1-Correl^2) * qnorm(x2)))^2] data[, Independent_Variable10 := (pnorm(Correl * x1 + sqrt(1-Correl^2) * qnorm(x2)))^4] data[, Independent_Variable11 := ifelse(Independent_Variable2 < 0.20, "A", ifelse(Independent_Variable2 < 0.40, "B", ifelse(Independent_Variable2 < 0.6, "C", ifelse(Independent_Variable2 < 0.8, "D", "E"))))] # We’ll use this as a mock predicted value data[, Predict := (pnorm(Correl * x1 + sqrt(1-Correl^2) * qnorm(x2)))] # Remove the helper columns data[, ':=' (x1 = NULL, x2 = NULL)] # In the ParDepCalPlot() function below, note the Function argument - # we are using mean() to aggregate our values but you # can use quantile(x, probs = y) for quantile regression # Partial Dependence Calibration Plot: p1 <- RemixAutoML::ParDepCalPlots(data, PredictionColName = "Predict", TargetColName = "Target", IndepVar = "Independent_Variable1", GraphType = "calibration", PercentileBucket = 0.05, FactLevels = 10, Function = function(x) mean(x, na.rm = TRUE)) # Partial Dependence Calibration BoxPlot: note the GraphType argument p2 <- RemixAutoML::ParDepCalPlots(data, PredictionColName = "Predict", TargetColName = "Target", IndepVar = "Independent_Variable1", GraphType = "boxplot", PercentileBucket = 0.05, FactLevels = 10, Function = function(x) mean(x, na.rm = TRUE)) # Partial Dependence Calibration Plot: p3 <- RemixAutoML::ParDepCalPlots(data, PredictionColName = "Predict", TargetColName = "Target", IndepVar = "Independent_Variable4", GraphType = "calibration", PercentileBucket = 0.05, FactLevels = 10, Function = function(x) mean(x, na.rm = TRUE)) # Partial Dependence Calibration BoxPlot for factor variables: p4 <- RemixAutoML::ParDepCalPlots(data, PredictionColName = "Predict", TargetColName = "Target", IndepVar = "Independent_Variable11", GraphType = "calibration", PercentileBucket = 0.05, FactLevels = 10, Function = function(x) mean(x, na.rm = TRUE)) # Plot all the individual graphs in a single pane RemixAutoML::multiplot(plotlist = list(p1,p2,p3,p4), cols = 2)]]>
You’d be surprised at how many data scientists don’t know how to turn their probabilities into class labels. Often times they will just go with 50% as the cutoff without a second thought, regardless of any class imbalances or asymmetric costs in the outcomes. I’ve even spoken to data scientists in the healthcare industry where predicting events such as “has disease” versus “does not have disease” came without utility thresholds. For some disease predictions it might not matter much, but for others, how can you possibly use an arbitrary threshold knowing that the cost of a false positive will have significantly different effects on a patient compared to a false negative? You should be applying either the threshOptim()
function or the RedYellowGreen()
function from the RemixAutoML package in R for situations like this.
Okay, so you spent time building out an awesome classification model. You are seeing a great AUC compared to previous versions. Now what? Your product manager asks you what threshold to use for classifying your predicted probabilities. How do you answer that?
You should know how to answer this question. There are several methods you can use. H2O, for example, offers several which may be useful for you to know. Those are:
Okay, those sound technical, but which one do you use to optimize asymmetrical costs and profits for correct predictions and Type 1 and Type 2 errors? Let’s say that the payoff matrix looks like the one below. H2O defaults to max f1 which will typically be sufficient for most cases but they also offer F2 for penalizing a large number of false negatives and f0point5 for penalizing a large number of false positives. Those measure get you closer to where we want to be but why not be precise with optimizing the threshold?
If your confusion matrix looks something like the below table, such that it’s not comprised of 1’s for correct predictions nor is it comprised of 0’s for incorrect predictions (default values), then you should use be using thethreshOptim()
and RedYellowGreen()
functions in the RemixAutoML package for R.
Actual \ Predicted | Positive Prediction | Negative Prediction |
Positive Outcome | 0.0 | -15 |
Negative Outcome | -4.0 | 0.0 |
The threshOptim()
function utilizes the costs in the confusion matrix to determine a single optimal threshold based on the threshold that maximizes utility. For cases when uncertain probability predictions should warrant further analysis by a medical professional, you should use the RedYellowGreen()
function. The function is designed to allow you to plug in not only the costs of a false positive / false negative but also the cost of further analysis, thus providing two thresholds. Any predicted probability that falls in between the two thresholds should be sent for review while the predicted probabilities that fall below the lower threshold should be an obvious negative outcome and those above the upper threshold should be obvious cases of a positive outcome.
Below is a sample plot output in R from RemixAutoML::RedYellowGreen()
that is automatically generating from running it. The lower threshold is 0.32 and the upper threshold is 0.34. If you generate a predicted probability of 0.33, you would send that instance for further review.
Utility calculation for threshOptim()
Utility calculation for RedYellowGreen