Accurate demand forecasts are necessary if you’re a retailer who has one of their competitors being Amazon. Want to lose business to Amazon? Then produce sh**ty demand forecasts. It is also actually one of the “low-hanging fruits” of a new data science department at a company who’s just getting started on machine learning and AI initiatives. With accurate demand forecasts, you can boost profits by optimizing your labor, prices, and inventory.
Typically, when companies are creating forecasts, they’re creating forecasts on a time series basis. That is, they are generating daily, weekly, monthly, quarterly or yearly forecasts. Some real world business applications of time series forecasting are:
Retail or B2B:
eCommerce:
Lack of Automation. Many current forecasting processes at companies require someone or multiple people to update an ugly, complicated Excel spreadsheet with multiple tabs and formulas. The process for doing this is error prone.
Scalability. Often, forecasting processes at companies are done by using someone’s own non-statistical methodology for forecasting, and that someone usually leaves no documentation for how to update it, reverse engineer it, or integrate it to current business processes.
Computation and Turnaround Time. Let’s face it. Doing thousands or hundreds of thousands of forecasts takes a long time to do. Especially if they’re manual. At past companies, we’ve seen this process takes several hours and sometimes days. The other thing is that the managers, VPs, and business stakeholders need it done yesterday and run around like it’s a big deal if it’s not done on their arbitrary deadlines.
Lack of Resources and Personnel. Several people could be involved in creating forecasts for thousands of stores or SKUs, and it becomes an even bigger challenge if those people need to be quantitative experts.
Bias and Lack of Accuracy. Oftentimes, there’s too much manual and human intervention giving “guard rails” to the forecasts with no documentation on why they were put in place. Any form of human intervention leads to what is called “error” in time series forecasts, which is the difference between the actual and the predicted value.
AutoCatBoostCARMA()
from the RemixAutoML packages fixes all of these problems in just a single function (basically, just one line of R code) and would take minutes via GPU or only a few hours via CPU.
AutoCatBoostCARMA is a multivariate forecasting function from the RemixAutoML package in R that leverages the CatBoost gradient boosting algorithm. CARMA stands for Calendar, Autoregressive, Moving Average + time trend. AutoCatBoostCARMA really shines for multivariate time series forecasting. Most time series modeling functions can only build a model on a single series at a time. AutoCatBoostCARMA can build any number of time series all at once. You can run it for a single time series, but I have found that AutoTS()
from RemixAutoML will almost always generate more accurate results.
The function replicates an ARMA process (autoregressive moving average) in that it will build the model utilizing lags and moving averages off of the target variable. It will then make a one-step ahead forecast, use the forecast value to regenerate the lags and moving averages, forecast the next step, and repeat, for all forecasting steps, just like an ARMA model does. However, there are several other features that the model utilizes.
The full set of model features includes:
A recent study evaluated the performance of many classical and modern machine learning and deep learning methods on a large set of more than 1,000 univariate time series forecasting problems.
The results of this study suggest that simple classical methods, such as ARIMA and exponential smoothing, outperform complex and sophisticated methods, such as decision trees, Multilayer Perceptrons (MLP), and Long Short-Term Memory (LSTM) network models.
We decided to do a similar experiment by comparing AutoTS (also from the RemixAutoML package) versus AutoCatBoostCARMA on the Walmart store sales data set from Kaggle. Here are the overall highlights of that experiment:
Overall highlights:
NOTE: WE DON’T RECOMMEND YOU RUNNING THIS R CODE UNLESS YOU’VE CONFIGURED A GPU, OTHERWISE IT WILL TAKE 3-5 HOURS. WITH GPU, YOU SHOULD BE ABLE TO COMPLETE ~2,660 FORECASTS IN 15 MINUTES USING AN NVIDIA GeForce 1080 Ti GPU
library(RemixAutoML) library(data.table) ######################################## # Prepare data for AutoTS()---- ######################################## # Load Walmart Data from Remix Institute's Box Account---- data <- data.table::fread("https://remixinstitute.box.com/shared/static/9kzyttje3kd7l41y1e14to0akwl9vuje.csv") # Subset for Stores / Departments with Full Series Available: (143 time points each)---- data <- data[, Counts := .N, by = c("Store","Dept")][ Counts == 143][, Counts := NULL] # Subset Columns (remove IsHoliday column)---- keep <- c("Store","Dept","Date","Weekly_Sales") data <- data[, ..keep] # Group Concatenation---- data[, GroupVar := do.call(paste, c(.SD, sep = " ")), .SDcols = c("Store","Dept")] data[, c("Store","Dept") := NULL] # Grab Unique List of GroupVar---- StoreDept <- unique(data[["GroupVar"]]) # AutoTS() Builds: Keep Run Times and AutoTS() Results---- # NOTES: # 1. SkipModels: run everything # 2. StepWise: runs way faster this way (cartesian check otherwise, but parallelized) # 3. TSClean: smooth outliers and do time series imputation # over the data first # 4. ModelFreq: algorithmically identify a series frequency to build # your ts data object TimerList <- list() OutputList <- list() l <- 0 for(i in StoreDept) { l <- l + 1 temp <- data[GroupVar == eval(i)] temp[, GroupVar := NULL] TimerList[[i]] <- system.time( OutputList[[i]] <- tryCatch({ RemixAutoML::AutoTS( temp, TargetName = "Weekly_Sales", DateName = "Date", FCPeriods = 52, HoldOutPeriods = 30, EvaluationMetric = "MAPE", TimeUnit = "week", Lags = 25, SLags = 1, NumCores = 4, SkipModels = NULL, StepWise = TRUE, TSClean = TRUE, ModelFreq = TRUE, PrintUpdates = FALSE)}, error = function(x) "Error in AutoTS run")) print(l) } # Save Results When Done and Pull Them in After AutoCatBoostCARMA() Run---- save(TimerList, file = paste0(getwd(),"/TimerList.R")) save(OutputList, file = paste0(getwd(),"/OutputList.R")) ######################################## # Prepare data for AutoCatBoostCARMA()---- ######################################## # Load Walmart Data from Remix Institute's Box Account---- data <- data.table::fread("https://remixinstitute.box.com/shared/static/9kzyttje3kd7l41y1e14to0akwl9vuje.csv") # Subset for Stores / Departments With Full Series (143 time points each)---- data <- data[, Counts := .N, by = c("Store","Dept")][ Counts == 143][, Counts := NULL] # Subset Columns (remove IsHoliday column)---- keep <- c("Store","Dept","Date","Weekly_Sales") data <- data[, ..keep] # Run AutoCatBoostCARMA()---- # NOTES: # 1. GroupVariables get concatenated into a single column but returned back to normal # 2. Lags and MA_Periods cover both regular and seasonal so mix it up! # 3. CalendarVariables: # seconds, hour, wday, mday, yday, week, isoweek, month, quarter, year # 4. TimeTrendVariable: 1:nrow(x) by group with 1 being the furthest back in time # no need for quadratic or beyond since catboost will fit nonlinear relationships # 5. DataTruncate: TRUE to remove records with imputed values for NA's created by the # DT_GDL_Feature_Engineering # 6. SplitRatios - written the way it is to ensure same ratio split as AutoTS() # 7. TaskType - I use GPU but if you don't have one, set to CPU # 8. I did not set GridTune to TRUE because I didn't want to wait # 9. GridEvalMetric and ModelCount only matter if GridTune is TRUE # 10. NTrees - Yes, I used 15k trees and I could have used more since the best model # performance utilized all trees (hit upper boundary) # 11. PartitionType - "timeseries" allows time-based splits by groups IF you have equal sized # groups within each series ("random" is well, random; "time" is for transactional data) # 12. Timer - Set to TRUE to get a print out of which forecasting step you are on when the # function hits that stage # *13. TargetTransformation is a new feature. Automatically choose the best transformation for # the target variable. Tries YeoJohnson, BoxCox, arcsinh, along with # asin(sqrt(x)) and logit for proportion data Results <- RemixAutoML::AutoCatBoostCARMA( data, TargetColumnName = "Weekly_Sales", DateColumnName = "Date", GroupVariables = c("Store","Dept"), FC_Periods = 52, TimeUnit = "week", TargetTransformation = TRUE, Lags = c(1:25, 51, 52, 53), MA_Periods = c(1:25, 51, 52, 53), CalendarVariables = TRUE, TimeTrendVariable = TRUE, DataTruncate = FALSE, SplitRatios = c(1 - 2*30/143, 30/143, 30/143), TaskType = "GPU", EvalMetric = "MAE", GridTune = FALSE, GridEvalMetric = "mae", ModelCount = 1, NTrees = 20000, PartitionType = "timeseries", Timer = TRUE) # Plot aggregate sales forecast (Stores and Departments rolled up into Total)---- Results$TimeSeriesPlot # Metrics for every store / dept. combo---- # NOTES: # 1. Can also pull back other AutoCatBoostRegression() info such as # Variable Importance, Evaluation Plots / BoxPlots, Partial # Dependence Plots / BoxPlots, etc. ML_Results <- Results$ModelInformation$EvaluationMetricsByGroup # Transformation info: # ColumnName = Variable Modified # NethodName = Transformation Method # Lambda = lambda value for YeoJohnson or BoxCox; NA otherwise # NormalizedStatistic = pearson statistic # Note: value of 0.0000 is a filler value for prediction values # and it's included to show that the correct transformation was done TransformInfo <- Results$TransformationDetail # ColumnName MethodName Lambda NormalizedStatistics # 1: Weekly_Sales YeoJohnson 0.6341344 532.3125 # 2: Predictions YeoJohnson 0.6341344 0.0000 ################################################## # AutoTS() and AutoCatBoostCARMA() Comparison---- ################################################## # Load AutoTS outputs we saved earlier---- load(paste0(getwd(), "/TimerList.R")) load(paste0(getwd(), "/OutputList.R")) # Group Concatenation---- data[, GroupVar := do.call(paste, c(.SD, sep = " ")), .SDcols = c("Store","Dept")] data[, c("Store","Dept") := NULL] # Grab unique list of GroupVar StoreDept <- unique(data[["GroupVar"]]) # AutoTS: format results---- results <- list() for(i in 1:2660) { results[[i]] <- tryCatch({ OutputList[[i]]$EvaluationMetrics[1,]}, error = function(x) return(data.table::data.table( ModelName = "NONE", MeanResid = NA, MeanPercError = NA, MAPE = NA, MAE = NA, MSE = NA, ID = 0))) } # AutoTS() Results---- Results <- data.table::rbindlist(results) # AutoTS() Model Winners by Count---- print( data.table::setnames( Results[, .N, by = "ModelName"][order(-N)], "N", "Counts of Winners")) # ModelName Counts of Winners # 1: TBATS 556 # 2: TSLM_TSC 470 # 3: TBATS_TSC 469 # 4: ARIMA 187 # 5: ARIMA_TSC 123 # 6: TBATS_ModelFreq 117 # 7: ARFIMA 86 # 8: NN 74 # 9: ETS 69 # 10: ARIMA_ModelFreq 68 # 11: NN_TSC 66 # 12: NN_ModelFreqTSC 63 # 13: NN_ModelFreq 60 # 14: ARFIMA_TSC 52 # 15: ETS_ModelFreq 51 # 16: TBATS_ModelFreqTSC 38 # 17: TSLM_ModelFreqTSC 29 # 18: ARFIMA_ModelFreqTSC 27 # 19: ETS_ModelFreqTSC 23 # 20: ARIMA_ModelFreqTSC 15 # 21: ARFIMA_ModelFreq 11 # 22: NONE 6 # ModelName Counts of Winners # AutoTS() Run Times---- User <- data.table::data.table(data.table::transpose(TimerList)[[1]]) data.table::setnames(User,"V1","User") SystemT <- data.table::data.table(data.table::transpose(TimerList)[[2]]) data.table::setnames(SystemT,"V1","System") Elapsed <- data.table::data.table(data.table::transpose(TimerList)[[3]]) data.table::setnames(Elapsed,"V1","Elapsed") Times <- cbind(User, SystemT, Elapsed) # AutoTS Run time Results---- MeanTimes <- Times[, .(User = sum(User), System = sum(System), Elapsed = sum(Elapsed))] # AutoTS() Run Time In Hours---- print(MeanTimes/60/60) # User System Elapsed # 1: 29.43282 0.3135111 33.24209 # AutoTS() Results Preparation---- Results <- cbind(StoreDept, Results) GroupVariables <- c("Store","Dept") Results[, eval(GroupVariables) := data.table::tstrsplit(StoreDept, " ")][ , ':=' (StoreDept = NULL, ID = NULL)] data.table::setcolorder(Results, c(7,8,1:6)) # Merge in AutoCatBoostCARMA() and AutoTS() Results---- FinalResults <- merge(ML_Results, Results, by = c("Store","Dept"), all = FALSE) # Add Indicator Column for AutoCatBoostCARMA() Wins---- FinalResults[, AutoCatBoostCARMA := ifelse(MAPE_Metric < MAPE, 1, 0)] # Percentage of AutoCatBoostCARMA() Wins---- print(paste0("AutoCatBoostCARMA() performed better on MAPE values ", round( 100 * FinalResults[!is.na(MAPE), mean(AutoCatBoostCARMA)], 1), "% of the time vs. AutoTS()")) # [1] "AutoCatBoostCARMA() performed better on MAPE values 41% of the time vs. AutoTS()" # AutoCatBoostCARMA() Average MAPE by Store and Dept---- print(paste0("AutoCatBoostCARMA() Average MAPE of ", round( 100 * FinalResults[!is.na(MAPE), mean(MAPE_Metric)], 1), "%")) # [1] "AutoCatBoostCARMA() Average MAPE of 14.1%" # AutoTS() Average MAPE by Store and Dept---- print(paste0("AutoTS() Average MAPE of ", round( 100 * FinalResults[!is.na(MAPE), mean(MAPE)], 1), "%")) # [1] "AutoTS() Average MAPE of 12%" ################################################# # AutoTS() by top 100 Grossing Departments---- ################################################# temp <- data[, .(Weekly_Sales = sum(Weekly_Sales)), by = "GroupVar"][ order(-Weekly_Sales)][1:100][, "GroupVar"] GroupVariables <- c("Store","Dept") temp[, eval(GroupVariables) := data.table::tstrsplit(GroupVar, " ")][ , ':=' (GroupVar = NULL, ID = NULL)] temp1 <- merge(FinalResults, temp, by = c("Store","Dept"), all = FALSE) # Percentage of AutoCatBoostCARMA() Wins---- print(paste0("AutoCatBoostCARMA() performed better on MAPE values ", round( 100 * temp1[!is.na(MAPE), mean(AutoCatBoostCARMA)], 1), "% of the time vs. AutoTS()")) # [1] "AutoCatBoostCARMA() performed better than AutoTS() on MAPE values 47% of the time" # AutoCatBoostCARMA() Average MAPE by Store and Dept---- print(paste0("AutoCatBoostCARMA() Average MAPE of ", round( 100 * temp1[!is.na(MAPE), mean(MAPE_Metric)], 1), "%")) # [1] "AutoCatBoostCARMA() Average MAPE of 5.6%" # AutoTS() Average MAPE by Store and Dept---- print(paste0("AutoTS() Average MAPE of ", round( 100 * temp1[!is.na(MAPE), mean(MAPE)], 1), "%")) # [1] "AutoTS() Average MAPE of 5.6%"]]>
Automated forecasting is the process of automating data wrangling and data preparation of your time series data, splitting the data into training and holdout data, training several different time series models, testing each of those models onto a holdout data set to measure its accuracy, then choosing the most accurate model and re-fitting on the entire data set to create a forecast over a specified time horizon. This could typically take several steps and hundreds of lines of code, but AutoTS does this type of automated forecasting in a single line of code.
Typically, when companies are creating forecasts, they’re creating forecasts on a time series basis. That is, they are generating daily, weekly, monthly, quarterly or yearly forecasts.
Some examples of forecasting that we’ve seen at Fortune 500 companies and tech startups by industry are:
Some of the challenges of enterprise forecasting is doing so in an automated, scalable, and unbiased way. Too many times when creating forecasts, business unit stakeholders create complicated Excel spreadsheets, with lots of tabs and formulas and ugly formatting, using their own individual methodology, and leaving no process for how to update or reverse engineer. Often, when the employee(s) who manages those Excel spreadsheets leave(s) the company, the enterprise use of the forecast stops, and the process has to be re-built from scratch.
So this current process is neither automated (it requires specific personnel to manually update it), scalable (because Excel doesn’t scale, and the forecasts stop as soon as the employee leaves), nor unbiased (as the employee had their own individual methodology to forecast without giving insight into it). Additionally, forecasts at enterprises are generated by non-qualified, non-quantitative personnel with poor Excel skills and likely no coding or statistical background, resulting in forecast errors.
AutoTS stands for automated time series, and it automatically finds and creates the most accurate forecast from a list of 7 econometric time series models including ARIMA, Holt-Winters, and Autoregressive Neural Networks.
It’s a function inside the RemixAutoML package in the open-source programming language R. R is a popular programming language for data scientists and analysts that is used to build statistical and machine learning models along with data visualizations.
The beauty of AutoTS and RemixAutoML is their simplicity and ease of use. Even if you’ve never programmed in R, you can still use AutoTS easily. If you’ve ever used a function inside Excel like sum() or if() formulas, then you can code using AutoTS.
The logo of AutoTS is a robot sniper, which symbolizes automation and accuracy.
AutoTS solves the automation problem because it eliminates manual updates of Excel forecast templates and eliminates relying on an employee’s methodology with no oversight. This methodology was likely created by someone with a non-quantitative background, but AutoTS uses best-in-class statistical and machine learning models. So you won’t have to worry about inaccurate forecasts.
AutoTS solves the scalability problem since it’s open source and code-based, and therefore, by its nature, reproducible. It can also be integrated into several popular BI platforms that have R integration, such as Tableau and PowerBI, as well as drag-and-drop analytics platforms like Alteryx.
AutoTS solves the bias problem since it doesn’t rely on human judgement, intuition, or manual intervention. That’s typically what creates error and bad decision-making in the first place. AutoTS is machine learning and statistically based.
AutoTS produces accurate forecasts by running your data through 7 different econometric time series models and choosing the most accurate one that predicts best out-of-sample. Out-of-sample is defined as the holdout data set. Accuracy is defined as lowest mean absolute percentage error (MAPE).
The data set we’re using is weekly sales by Walmart store from Kaggle. The R code will do some basic data wrangling to get total sales by week for the highest grossing store, as the raw data set is by week, store, and department. If you have a internal company data set with a metric you want to forecast grouped by day, you can substitute it at Line 34, where “top_store_weekly_sales” is defined. Then change the TimeUnit in AutoTS to “day”.
You can see how few lines of code are needed to create accurate, automated, scalable, and unbiased forecasts using machine learning. No more messy spreadsheets. Technically, AutoTS only uses 1 line of R code, but we dedicated each function argument as its own line just for tutorial presentation purposes.
We drew some inspiration for branding the forecast plot output with RemixAutoML based on Michael Toth’s blog here.
library(RemixAutoML) library(data.table) library(dplyr) library(magrittr) library(ggplot2) library(scales) library(magick) library(grid) # IMPORT DATA FROM REMIX INSTITUTE BOX ACCOUNT ---------- # link to manually download file: https://remixinstitute.app.box.com/v/walmart-store-sales-data/ walmart_store_sales_data = data.table::fread("https://remixinstitute.box.com/shared/static/9kzyttje3kd7l41y1e14to0akwl9vuje.csv", header = T, stringsAsFactors = FALSE) # FIND TOP GROSSING STORE (USING dplyr) --------------------- # group by Store, sum Weekly Sales top_grossing_store = walmart_store_sales_data %>% dplyr::group_by(., Store) %>% dplyr::summarize(., Weekly_Sales = sum(Weekly_Sales, na.rm = TRUE)) # max Sales of 45 stores max_sales = max(top_grossing_store$Weekly_Sales) # find top grossing store top_grossing_store = top_grossing_store %>% dplyr::filter(., Weekly_Sales == max_sales) top_grossing_store = top_grossing_store$Store %>% as.numeric(.) # what is the top grossing store? print(paste("Store Number: ", top_grossing_store, sep = "")) # FIND WEEKLY SALES DATA FOR TOP GROSSING STORE (USING data.table) ---------- top_store_weekly_sales <- walmart_store_sales_data[Store == eval(top_grossing_store), .(Weekly_Sales = sum(Weekly_Sales, na.rm = TRUE)), by = "Date"] # FORECAST WEEKLY SALES FOR WALMART STORE USING AutoTS ------ # forecast for the next 16 weeks - technically 1 line of code, but # each argument was dedicated its own line for presentation purposes weekly_forecast = RemixAutoML::AutoTS( data = top_store_weekly_sales, TargetName = "Weekly_Sales", DateName = "Date", FCPeriods = 16, HoldOutPeriods = 12, TimeUnit = "week" ) # VISUALIZE AutoTS FORECASTS ---------------- # view 16 week forecast View(weekly_forecast$Forecast) # View model evaluation metrics View(weekly_forecast$EvaluationMetrics) # which model won? print(weekly_forecast$ChampionModel) # see ggplot of forecasts plot = weekly_forecast$TimeSeriesPlot #change y-axis to currency plot = plot + ggplot2::scale_y_continuous(labels = scales::dollar) #RemixAutoML branding. Inspiration here: https://michaeltoth.me/you-need-to-start-branding-your-graphs-heres-how-with-ggplot.html logo = magick::image_read("https://www.remixinstitute.com/wp-content/uploads/7b-Cheetah_Charcoal_Inline_No_Sub_No_BG.png") plot grid::grid.raster(logo, x = .73, y = 0.01, just = c('left', 'bottom'), width = 0.25)]]>
ICE and Partial Dependence Plots in LIME fail to tell me the accuracy surrounding the fitted relationship. Further, ICE doesn’t exactly tell me the probability of each of the lines occurring. Your model can overfit or underfit your data pretty easily, especially if you are using deep learning models. LIME (should be called LAME) fails to tell me how the model actually performs.
Imagine you are working on a price elasticity model that will guide pricing decisions. Currently you would show the relationship that the model was able to fit. Given that we will be using a model to guide pricing decisions, a sensible stakeholder might ask, “I see the relationship that your model fit, but how do I know that corresponds to the actual relationship?”
What do you do? Give the stakeholder some model accuracy metrics? Tell them that you used deep learning so they should just trust it because it is state-of-the-art technology?
Here is a simple solution to the shortfall of partial dependence plots: Use calibration on your predicted relationship. It’s that simple. Below is an example plot from the RemixAutoML package in R. The x-axis is the independent variable of interest. The spacing between ticks are based on percentiles of the distribution of the independent variable. What that means is that, across the x-axis, the data is uniformly distributed, so no need for the dashes as shown above in the ICE chart. Secondly, we can see the relationship of the independent variable as it relates to the target variable, as does the partial dependence plots, but we can also see how good a fit the model has across the range of the independent variable. This addresses the skepticism from your stakeholders about the accuracy of your predictions. If you want to see the variability of your predictions, use the boxplot version too. If you want to see the relationship for specific group, simply subset your data so only that group of interest is included, and rerun the function.
####################################################### # Create data to simulate validation data with predicted values ####################################################### # Correl: This is the correlation used to determine how correlated the variables are to # the target variable. Switch it up (between 0 and 1) to see how the charts below change. Correl <- 0.85 data <- data.table::data.table(Target = runif(1000)) # Mock independent variables - they are correlated variables with # various transformations so you can see different kinds of relationships # in the charts below # Helper columns for creating simulated variables data[, x1 := qnorm(Target)] data[, x2 := runif(1000)] # Create one variable at a time data[, Independent_Variable1 := log(pnorm(Correl * x1 + sqrt(1-Correl^2) * qnorm(x2)))] data[, Independent_Variable2 := (pnorm(Correl * x1 + sqrt(1-Correl^2) * qnorm(x2)))] data[, Independent_Variable3 := exp(pnorm(Correl * x1 + sqrt(1-Correl^2) * qnorm(x2)))] data[, Independent_Variable4 := exp(exp(pnorm(Correl * x1 + sqrt(1-Correl^2) * qnorm(x2))))] data[, Independent_Variable5 := sqrt(pnorm(Correl * x1 + sqrt(1-Correl^2) * qnorm(x2)))] data[, Independent_Variable6 := (pnorm(Correl * x1 + sqrt(1-Correl^2) * qnorm(x2)))^0.10] data[, Independent_Variable7 := (pnorm(Correl * x1 + sqrt(1-Correl^2) * qnorm(x2)))^0.25] data[, Independent_Variable8 := (pnorm(Correl * x1 + sqrt(1-Correl^2) * qnorm(x2)))^0.75] data[, Independent_Variable9 := (pnorm(Correl * x1 + sqrt(1-Correl^2) * qnorm(x2)))^2] data[, Independent_Variable10 := (pnorm(Correl * x1 + sqrt(1-Correl^2) * qnorm(x2)))^4] data[, Independent_Variable11 := ifelse(Independent_Variable2 < 0.20, "A", ifelse(Independent_Variable2 < 0.40, "B", ifelse(Independent_Variable2 < 0.6, "C", ifelse(Independent_Variable2 < 0.8, "D", "E"))))] # We’ll use this as a mock predicted value data[, Predict := (pnorm(Correl * x1 + sqrt(1-Correl^2) * qnorm(x2)))] # Remove the helper columns data[, ':=' (x1 = NULL, x2 = NULL)] # In the ParDepCalPlot() function below, note the Function argument - # we are using mean() to aggregate our values but you # can use quantile(x, probs = y) for quantile regression # Partial Dependence Calibration Plot: p1 <- RemixAutoML::ParDepCalPlots(data, PredictionColName = "Predict", TargetColName = "Target", IndepVar = "Independent_Variable1", GraphType = "calibration", PercentileBucket = 0.05, FactLevels = 10, Function = function(x) mean(x, na.rm = TRUE)) # Partial Dependence Calibration BoxPlot: note the GraphType argument p2 <- RemixAutoML::ParDepCalPlots(data, PredictionColName = "Predict", TargetColName = "Target", IndepVar = "Independent_Variable1", GraphType = "boxplot", PercentileBucket = 0.05, FactLevels = 10, Function = function(x) mean(x, na.rm = TRUE)) # Partial Dependence Calibration Plot: p3 <- RemixAutoML::ParDepCalPlots(data, PredictionColName = "Predict", TargetColName = "Target", IndepVar = "Independent_Variable4", GraphType = "calibration", PercentileBucket = 0.05, FactLevels = 10, Function = function(x) mean(x, na.rm = TRUE)) # Partial Dependence Calibration BoxPlot for factor variables: p4 <- RemixAutoML::ParDepCalPlots(data, PredictionColName = "Predict", TargetColName = "Target", IndepVar = "Independent_Variable11", GraphType = "calibration", PercentileBucket = 0.05, FactLevels = 10, Function = function(x) mean(x, na.rm = TRUE)) # Plot all the individual graphs in a single pane RemixAutoML::multiplot(plotlist = list(p1,p2,p3,p4), cols = 2)]]>
You’d be surprised at how many data scientists don’t know how to turn their probabilities into class labels. Often times they will just go with 50% as the cutoff without a second thought, regardless of any class imbalances or asymmetric costs in the outcomes. I’ve even spoken to data scientists in the healthcare industry where predicting events such as “has disease” versus “does not have disease” came without utility thresholds. For some disease predictions it might not matter much, but for others, how can you possibly use an arbitrary threshold knowing that the cost of a false positive will have significantly different effects on a patient compared to a false negative? You should be applying either the threshOptim()
function or the RedYellowGreen()
function from the RemixAutoML package in R for situations like this.
Okay, so you spent time building out an awesome classification model. You are seeing a great AUC compared to previous versions. Now what? Your product manager asks you what threshold to use for classifying your predicted probabilities. How do you answer that?
You should know how to answer this question. There are several methods you can use. H2O, for example, offers several which may be useful for you to know. Those are:
Okay, those sound technical, but which one do you use to optimize asymmetrical costs and profits for correct predictions and Type 1 and Type 2 errors? Let’s say that the payoff matrix looks like the one below. H2O defaults to max f1 which will typically be sufficient for most cases but they also offer F2 for penalizing a large number of false negatives and f0point5 for penalizing a large number of false positives. Those measure get you closer to where we want to be but why not be precise with optimizing the threshold?
If your confusion matrix looks something like the below table, such that it’s not comprised of 1’s for correct predictions nor is it comprised of 0’s for incorrect predictions (default values), then you should use be using thethreshOptim()
and RedYellowGreen()
functions in the RemixAutoML package for R.
Actual \ Predicted | Positive Prediction | Negative Prediction |
Positive Outcome | 0.0 | -15 |
Negative Outcome | -4.0 | 0.0 |
The threshOptim()
function utilizes the costs in the confusion matrix to determine a single optimal threshold based on the threshold that maximizes utility. For cases when uncertain probability predictions should warrant further analysis by a medical professional, you should use the RedYellowGreen()
function. The function is designed to allow you to plug in not only the costs of a false positive / false negative but also the cost of further analysis, thus providing two thresholds. Any predicted probability that falls in between the two thresholds should be sent for review while the predicted probabilities that fall below the lower threshold should be an obvious negative outcome and those above the upper threshold should be obvious cases of a positive outcome.
Below is a sample plot output in R from RemixAutoML::RedYellowGreen()
that is automatically generating from running it. The lower threshold is 0.32 and the upper threshold is 0.34. If you generate a predicted probability of 0.33, you would send that instance for further review.
Utility calculation for threshOptim()
Utility calculation for RedYellowGreen