The post How To Say No To Useless Data Science Projects And Start Working On What You Want first appeared on Remix Institute.
The post How To Say No To Useless Data Science Projects And Start Working On What You Want appeared first on Remix Institute.
]]>The post How To Say No To Useless Data Science Projects And Start Working On What You Want first appeared on Remix Institute.
The post How To Say No To Useless Data Science Projects And Start Working On What You Want appeared first on Remix Institute.
]]>The post Quick Time Series Analysis of the CCI30 Crypto Index first appeared on Remix Institute.
The post Quick Time Series Analysis of the CCI30 Crypto Index appeared first on Remix Institute.
]]>So it’s time for a short review and forecast. To do this, I use R inside of RStudio. I use the following packages with this quick piece of code:
install.load::install_load(
"tidyquant"
,"timetk"
, "tibbletime"
, "sweep"
, "anomalize"
, "caret"
, "forecast"
, "funModeling"
, "xts"
, "fpp"
, "lubridate"
, "tidyverse"
, "urca"
, "prophet"
)
From the CCI30 (who graciously make their index data available), I grab the file, and we have the Date and OHLCV (Open, High, Low, Close, Volume) columns. We can inspect the first row of the data:
head(df.tibble, 1)
# A time tibble: 1 x 6
# Index: Date
Date Open High Low Close Volume
<date> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2019-12-30 2546. 2578. 2481. 2501. 45315440388.
A simple feature plot of the OHLCV gives the following:
From there I generate the daily return and log daily return of the closing price of the index. I then collapse the data by month and get the monthly log return.
df.ts.monthly <- df.ts.tbl %>%
tq_transmute(
select = Close
, periodReturn
, period = "monthly"
, type = "log"
, col_rename = "Monthly.Log.Returns"
)
head(df.ts.monthly, 5)
# A time tibble: 5 x 2
# Index: Date
Date Monthly.Log.Returns
<date> <dbl>
1 2015-01-31 -0.396
2 2015-02-28 0.0807
3 2015-03-31 -0.138
Here is a decomposition of the daily log return of the index:
ACF (Auto Correlation Function) of Daily Log Returns:
After collapsing the data into a monthly time series format we again take a look at the decomposition:
Now, let us look for anomalies in the monthly data. To do this, I use the anomalize
package.
dfa_tsb <- df.ts.monthly %>%
time_decompose(Monthly.Log.Returns, method = "tiwtter") %>%
anomalize(remainder, method = "gesd") %>%
time_recompose()
dfa_tsb %>%
plot_anomaly_decomposition() +
xlab("Monthly Log Return") +
ylab("Value") +
labs(
title = "Anomaly Detection for CCI30 Monthly Log Returns"
, subtitle = "Method: GESD"
)
We can easily see the anomalous returns during, what I refer to as, the mainstream crypto craze of 2017.
With all of this done, we move onto the forecast of the index. I forecast 12 months out using a few different models: HW (Holt-Winters), ETS (Error, Trend, Seasonality), Bagged ETS, ARIMA, SNaive and Facebook Prophet. These models produce the following:
There is another package out called RemixAutoML where they provide an AutoTS (automated time series) function which we will also use and see if results differ.
You can find my code on my GitHub. Feel free to contribute to the project.
Steven Paul Sanderson II, MPH is a Data Scientist at Long Island Community Hospital. He has several years experience working in data science and analytics and holds a Master’s in Public Health from the Stony Brook University Health Sciences Center College of Medicine and a Bachelor’s in Economics from the State University of New York at Stony Brook. You can connect with him on LinkedIn at: https://www.linkedin.com/in/spsanderson/
The post Quick Time Series Analysis of the CCI30 Crypto Index first appeared on Remix Institute.
The post Quick Time Series Analysis of the CCI30 Crypto Index appeared first on Remix Institute.
]]>The post Business AI for SMB’s: How To Compete With Amazon Using RemixAutoML first appeared on Remix Institute.
The post Business AI for SMB’s: How To Compete With Amazon Using RemixAutoML appeared first on Remix Institute.
]]>There are many reasons why this may be the case in any particular organization, though companies which share the struggle typically exhibit at least one of the following traits:
Sometimes data is rightly mistrusted because of the aforementioned problems in how it is collected and presented, but this issue can occur even in organizations with good data practices. It is a dangerous attitude for management to have, because even if the data has been properly collected and structured, various departments are collaborating, and the analysts and data scientists have done a good job of avoiding biases and other skews, reliable analysis can still be discarded due to a cultural mismatch. When this happens, management may pay lip service to analytics and data science, but will continue marching onward based on their personal preferences.
So what exactly is the manager to do in today’s environment? Most sensible, informed business people have by now realized that data science and analytics are going to be critical components of business success from here on out, but often the aforementioned problems run deep and present enormous roadblocks. And what about startups or small businesses that don’t have these same problems, but also doesn’t have the budget to build a full-scale analytics or data science team?
Thankfully, the general trend of markets is deflationary; stuff gets cheaper over time as it becomes more available, and this includes analytics and data science tools just as much as consumer goods. Consequently, open source tools which solve some of these issues — such as R, Python, and RemixAutoML — are beginning to appear, allowing even entrepreneurs and small enterprises access to capabilities which even just a short number of years ago would be hard to come by for the Fortune 500.
Let’s start with one example based off an Amazon model. If you’re looking for a company which leads in leveraging machine learning and AI to optimize sales and profits, look no further than Amazon. From AI-enabled products like Alexa, to AI-based product recommenders on Amazon’s website, to AWS machine learning software, Amazon has no shortage of uses for machine learning and AI. Even Jeff Bezos said in his 2016 Letter to Amazon Shareholders,
These big trends are not that hard to spot (they get talked and written about a lot), but they can be strangely hard for large organizations to embrace. We’re in the middle of an obvious one right now: machine learning and artificial intelligence.
At Amazon, we’ve been engaged in the practical application of machine learning for many years now. Some of this work is highly visible: our autonomous Prime Air delivery drones; the Amazon Go convenience store that uses machine vision to eliminate checkout lines; and Alexa,^{1} our cloud-based AI assistant. (We still struggle to keep Echo in stock, despite our best efforts. A high-quality problem, but a problem. We’re working on it.)
But much of what we do with machine learning happens beneath the surface. Machine learning drives our algorithms for demand forecasting, product search ranking, product and deals recommendations, merchandising placements, fraud detection, translations, and much more. Though less visible, much of the impact of machine learning will be of this type – quietly but meaningfully improving core operations.
Superficial advice like “look at how successful Amazon is, so just do what they do” won’t help you. It’s almost impossible to try. Amazon’s R&D budget alone is $22 billion (according to Statista), dwarfing even the total annual revenues of most large companies. In all likelihood, your company’s R&D budget is $0. And since many companies don’t understand the significant investment required to be a leader in machine learning and AI (like Amazon), odds are you need a more pragmatic approach that can be done on a shoestring budget. Then, if you can pick up some quick-win ROI, management may see justification for further investment in machine learning and AI initiatives.
The example we’re going to show you would be building Amazon’s “frequently bought together” product up-sell algorithm. According to McKinsey, 35% of what consumer purchase on Amazon comes from its product recommendation algorithms. Amazon has been doing product recommenders for decades, and that means decades of research and investment you won’t be able to match. However, you can build an Amazon-style ‘frequently bought together’ product recommender able to lift average order values and market basket sizes.
This is where you can use an open-source, automated machine learning tool (such as RemixAutoML) which allows small-to-medium sized businesses and startups to build Amazon-style ‘frequently bought together’ product recommendations with just a single line of code and very little data. Your organization may have a messy data warehouse, but the only data points needed are:
This data can easily be extracted from either your Point-Of-Sale system or your e-commerce platform. Using that data, an analytics professional can create a machine learning model with a single line of code in RemixAutoML capable of competing with even the largest big box retailers, thus lifting conversion rates, average order values, and repurchase rates while increasing market share.
Tools like RemixAutoML help overcome hurdles such as fractured data (since only few data points are required), mistrust of data and siloed objectives (as everyone can utilize and see immediate ROI), and biased analysis (as the tool uses laws of probability and machine learning to reduce bias).
As a quick example, consider this online retail dataset from an e-commerce company in the UK:
Again, the only two data points needed are invoice number (called InvoiceNo in this data set) and item number (called StockCode in the data set).
Running the following R code using RemixAutoML yields the end product below: a table of products frequently bought together based on highest statistical significance. This would equip the sales organization to leverage the table whenever it tries to upsell a customer with Product B given that Product A has been added to their shopping cart. This upsell could be facilitated by adding ‘frequently bought together’ recommenders on the website, in a personalized sales email, or at the Point-Of-Sale in a brick and mortar store. For more technical users, details are provided in the R code comments.
The output mirrors Amazon’s ‘frequently bought together’ algorithm. The column called StockCode_LHS means “StockCode Left-Hand Side” and is the product the customer has added to their cart. The column called StockCode_RHS means “StockCode Right-Hand Side” and is the product that is most ‘frequently bought together’ with StockCode_LHS at the highest statistically significant level.
Support means the percent of total transactions in which StockCode_LHS and StockCode_RHS appear together. Confidence means the conditional probability that StockCode_RHS is purchased given that StockCode_LHS has been added to the cart. The columns called Lift, Chi_SQ, and P_Value are all statistical significance metrics of the relationship between StockCode_LHS and StockCode_RHS. RuleRank is the ranking system that RemixAutoML uses to rank the market basket affinities for you.
Nick Gausling is a businessman and investor who has worked across multiple industries with companies both small and large. You can connect with Nick at https://www.linkedin.com/in/nick-gausling/ or www.nickgausling.com
Douglas Pestana is a data scientist with 10+ years experience in data science and machine learning at Fortune 500 and Fortune 1000 companies. He is one of the co-founders of Remix Institute.
library(data.table)
library(dplyr)
library(magrittr)
library(RemixAutoML)
# IMPORT ONLINE RETAIL DATA SET THEN CLEAN DATA SET-------
# Original Source: UCI Machine Learning Repository - https://archive.ics.uci.edu/ml/datasets/online+retail
# download file from Remix Insitute Box account
online_retail_data = data.table::fread("https://remixinstitute.box.com/shared/static/v2c7mkkqm9eswyqbzkg5tbqqzfa1885v.csv", header = T, stringsAsFactors = FALSE)
# create a flag for cancelled invoices
online_retail_data$CancelledInvoiceFlag = ifelse(substring(online_retail_data$InvoiceNo, 1, 1) == 'C', 1, 0)
# create a flag for negative quantities
online_retail_data$NegativeQuantityFlag = ifelse(online_retail_data$Quantity < 0, 1, 0)
# remove cancelled invoices and negative quantitites
online_retail_data_clean = online_retail_data %>% dplyr::filter(., CancelledInvoiceFlag != 1) %>%
dplyr::filter(., NegativeQuantityFlag != 1)
# PREP DATA SET FOR MODELING -------------
# for market basket analysis models, you'll need data grouped by invoice (InvoiceNo) and item number (StockCode). Then you can sum up the units sold.
online_retail_data_for_model = online_retail_data_clean %>% dplyr::group_by(., InvoiceNo, StockCode) %>%
dplyr::summarise(., Quantity = sum(Quantity, na.rm = TRUE) )
# RUN AUTOMATED MARKET BASKET ANALYSIS (PRODUCT RECOMMENDER) IN RemixAutoML -----------
# the AutoMarketBasketModel from RemixAutoML automatically converts your data,
# runs the market basket model algorithm, and adds Chi-Square statistics for significance
market_basket_model = RemixAutoML::AutoMarketBasketModel(
data = online_retail_data_for_model,
OrderIDColumnName = "InvoiceNo",
ItemIDColumnName = "StockCode"
)
# add product Description
# left-hand side products
StockCode_LHS_description = online_retail_data_clean %>% dplyr::select(., StockCode, Description) %>%
dplyr::rename(., StockCode_LHS = StockCode,
Description_LHS = Description
) %>%
dplyr::distinct(., StockCode_LHS, .keep_all = TRUE)
# right-hand side products
StockCode_RHS_description = online_retail_data_clean %>% dplyr::select(., StockCode, Description) %>%
dplyr::rename(., StockCode_RHS = StockCode,
Description_RHS = Description
)%>%
dplyr::distinct(., StockCode_RHS, .keep_all = TRUE)
# merge
market_basket_model_final = merge(market_basket_model, StockCode_RHS_description, by = 'StockCode_RHS', all.x = TRUE)
market_basket_model_final = merge(market_basket_model_final, StockCode_LHS_description, by = 'StockCode_LHS', all.x = TRUE)
# re-sort by StockCode_LHS and RuleRank
market_basket_model_final = market_basket_model_final[order(StockCode_LHS, RuleRank),]
# view results
View(market_basket_model_final)
The post Business AI for SMB’s: How To Compete With Amazon Using RemixAutoML first appeared on Remix Institute.
The post Business AI for SMB’s: How To Compete With Amazon Using RemixAutoML appeared first on Remix Institute.
]]>The post Why Machine Learning is more Practical than Econometrics in the Real World first appeared on Remix Institute.
The post Why Machine Learning is more Practical than Econometrics in the Real World appeared first on Remix Institute.
]]>I’ve read several studies and articles that claim Econometric models are still superior to machine learning when it comes to forecasting. In the article, “Statistical and Machine Learning forecasting methods: Concerns and ways forward”, the author mentions that,
“After comparing the post-sample accuracy of popular ML methods with that of eight traditional statistical ones, we found that the former are dominated across both accuracy measures used and for all forecasting horizons examined.”
In many business environments a data scientist is responsible for generating hundreds or thousands (possibly more) forecasts for an entire company, opposed to a single series forecast. While it appears that Econometric methods are better at forecasting a single series (which I generally agree with), how do they compare at forecasting multiple series, which is likely a more common requirement in the real world? Some other things to consider when digesting the takeaways from that study:
In this article, I am going to show you an experiment I ran that compares machine learning models and Econometrics models for time series forecasting on an entire company’s set of stores and departments.
Before I kick this off, I have to mention that I’ve come across several articles that describe how one can utilize ML for forecasting (typically with deep learning models) but I haven’t seen any that truly gives ML the best chance at outperforming traditional Econometric models. On top of that, I also haven’t seen too many legitimate attempts to showcase the best that Econometric models can do either. That’s where this article and evaluation differ. The suite of functions I tested are near-fully optimized versions of both ML models and Econometric models (list of models and tuning details are below). The functions come from the R open source package RemixAutoML, which is a suite of functions for automated machine learning (AutoML), automated forecasting, automated anomaly detection, automated recommender systems, automated feature engineering, and more. I provided the R script at the bottom of this article so you can replicate this experiment. You can also utilize the functions in Python via the r2py package and Julia via the RCall package.
The data I’m utilizing comes from Kaggle — weekly Walmart sales by store and department. I’m only using the store and department combinations that have complete data to minimize the noise added to the experiment, which leaves me with a total of 2,660 individual store and department time series. Each store & dept combo has 143 records of weekly sales. I also removed the “IsHoliday” column that was provided.
Given the comments from the article linked above, I wanted to test out several forecast horizons. The performance for all models are compared on n-step ahead forecasts, for n = {1,5,10,20,30}, with distinct model builds used for each n-step forecast test. For each run, I have 2,660 evaluation time series for comparison, represented by each store and department combination. In the Results section you can find the individual results for each of those runs.
In the experiment I used the AutoTS() function for testing out Econometric models and I used the RemixAutoML CARMA suite (Calendar-Auto-Regressive-Moving-Average) for testing out Machine Learning. The AutoTS() function tests out every model from the list below in in several ways (similar to grid tuning in ML). The ML suite contains 4 different tree-based algorithms. As a side note, the Econometric models all come from the forecast package in R. You can see a detailed breakdown of how each model is optimized below the Results section in this article.
The table outputs below shows the ranks of 11 models (7 Econometric and 4 Machine Learning) when it comes to lowest mean absolute error (MAE) for every single store and department combination (2,660 individual time series) across five different forecast horizons.
For example, in the 1-step ahead forecast table below, NN was the most accurate model on 666 of the 2,660 time series. TBATS was the most accurate 414 times out of the 2,660.
Still looking at the 1-step ahead forecast table below, the NN was the second most accurate on 397 out of 2,660 time series. TBATS was the second most accurate on 406 out of the 2,660 time series. TBATS ranked last place (11th) 14 times.
The histograms below were derived from selecting the best Econometrics models for each individual store and department time series (essentially the ensemble results) and the best Machine Learning models for each individual store and department time series (ensemble). You can see that as the forecast horizon grows, the Machine Learning models catch up and overcome (slightly) the Econometrics models. With the shorter forecast horizon, the Econometrics models outperform the Machine Learning models by a larger amount than the converse.
While the short term horizon forecasts are more accurate via the Econometrics models I tend to have a greater need for longer term forecasts for planning purposes and the Machine Learning models exceed the Econometrics in that category. On top of that, the run-time is a pretty significant factor for me.
If your business needs are the opposite, the Econometrics models are probably your best bet, assuming the run times are not a concern.
If I had enough resources available I’d run both functions and utilize the individual models that performed best for each series, which means I’d be utilizing all 11 models.
Each of the individual Econometrics models in AutoTS() are optimized based on the following treatments.
Global Optimizations (applies to all models):
A) Optimal Box-Cox Transformations are used in every run where data is strictly positive. The optimal transformation could be no transformation (artifact of Box-Cox).
B) Four different treatments are tested for each model:
The treatment of outlier smoothing and imputation sometimes has a beneficial effect on forecasts; sometimes it doesn’t. You really need to test out both to see what generates more accurate predictions out-of-sample. Same goes with manually defining the frequency of the data. If you have daily data, you specify “day” in the AutoTS arguments. Alternatively, if specified, spectral analysis is done to find the frequency of the data based on the dominant trend and seasonality. Sometimes this approach works better, sometimes it doesn’t. That’s why I test all the combinations for each model.
Individual Model Optimizations:
C) For the ARIMA and ARFIMA, I used up to 25 lags and moving averages, algorithmically determined how many differences and seasonal differences to use, and up to a single difference and seasonal difference can be used, all determined in the stepwise procedure (all combinations can be tested and run in parallel but it’s too time consuming for my patience).
D) For the Double Seasonal Holt-Winters model, alpha, beta, gamma, omega, and phi are determined using least-squares and the forecasts are adjusted using an AR(1) model for the errors.
E) The Exponential Smoothing State-Space model runs through an automatic selection of the error type, trend type, and season type, with the options being “none”, “additive”, and “multiplicative”, along with testing of damped vs. non-damped trend (either additive or multiplicative). Alpha, beta, and phi are estimated.
F) The Neural Network is set up to test out every combination of lags and seasonal lags (25 lags, 1 seasonal lag) and the version with the best holdout score is selected.
G) The TBATS model utilizes 25 lags and moving averages for the errors, damped trend vs. non-damped trend are tested, trend vs. non-trend are also tested, and the model utilizes parallel processing.
H) The TSLM model utilizes simple time trend and season depending on the frequency of the data.
The CARMA suite utilizes several features to ensure proper models are built to generate the best possible out-of-sample forecasts.
A) Feature engineering: I use a time trend, calendar variables, holiday counts, and 25 lags and moving averages along with 51, 52, and 53-week lags and moving averages (all specified as arguments in the CARMA function suite). Internally, the CARMA functions utilize several RemixAutoML functions, all written using data.table for fast and memory efficient processing:
B) Optimal transformations: the target variable along with the associated lags and moving average features were transformed. This is really useful for regression models with categorical features that have associated target values that significantly differ from each other. The transformation options that are tested (using a Pearson test for normality) include:
The functions used to create the transformations throughout the process and then back-transform them after the forecasts have been generated come from RemixAutoML :
C) Models: there are four CARMA functions and each use a different algorithm for the model fitting. The models used to fit the time series data come from RemixAutoML and include:
You can view all of the 21 process steps in those functions on my GitHub page README under the section titled, “Supervised Learning Models” in the “Regression” sub-section (you can also view the source code directly of course).
D) GPU: With the CatBoost and XGBoost functions, you can build the models utilizing GPU (I ran them with a GeForce 1080ti) which results in an average 10x speedup in model training time (compared to running on CPU with 8 threads). I should also note, the lags and moving average features by store and department and pretty intensive to compute and are built with data.table exclusively which means that if you have a CPU with a lot of threads then those calculations will be faster as data.table is parallelized.
E) One model for all series: I built the forecasts for all the store and department combinations with a single model by simply specifying c(“Store”,”Dept”) in the GroupVariables argument, which provides superior results compared to building a single model for each series. The group variables are used as categorical features and do not require one-hot-encoding before hand as CatBoost and H2O handle those internally. The AutoXGBoostCARMA() version utilizes the DummifyDT() function from RemixAutoML to handle the categorical features.
F) The max number of trees used for each model was (early stopping is used internally):
G) Grid tuning: I ran a 6 model random hyper-parameter grid tune for each algorithm. Essentially, a baseline model is built and then 5 other models are built and compared with the lowest MAE model being selected. This is all done internally in the CARMA function suite.
H) Data partitioning: for creating the training, validation, and test data, the CARMA functions utilize the RemixAutoML::AutoDataPartition()function and utilizes the “timeseries” option for the PartitionTypeargument which ensures that the train data reflects the furthest data points back in time, followed by the validation data, and then the test data which is the most recent data points in time. For the experiment, I used 10/143 as the percent holdout for validation data. The test data varied by which n-step ahead holdout was being tested, and the remaining data went to the training set.
I) Forecasting: Once the regression model is built, the forecast process replicates an ARIMA process. First, a single step-ahead forecast is made. Next, the lags and moving average features are updated, making use of the predicted values from the previous step. Next, the other features are updated (trend, calendar, holiday). Then the next forecast step is made; rinse and repeat for remaining forecasting steps. This process utilizes the RemixAutoML functions:
If anyone is interested in testing out other models, utilizing different data sets, or just need to set up automated forecasts for their company, contact me on LinkedIn.
If you’d like to learn how to utilize the RemixAutoML package check out the free course on Remyx Courses.
I have plans to continue enhancing and adding capabilities to the automated time series functions discussed above. For example, I plan to:
Code to reproduce: https://gist.github.com/AdrianAntico/8e1bbf63f26835756348d7c67a930227
library(RemixAutoML)
library(data.table)
###########################################
# Prepare data for AutoTS()----
###########################################
# Load Walmart Data ----
# link to manually download file: https://remixinstitute.app.box.com/v/walmart-store-sales-data/
data <- data.table::fread("https://remixinstitute.box.com/shared/static/9kzyttje3kd7l41y1e14to0akwl9vuje.csv", header = T, stringsAsFactors = FALSE)
# Subset for Stores / Departments with Full Series Available: (143 time points each)----
data <- data[, Counts := .N, by = c("Store","Dept")][Counts == 143][, Counts := NULL]
# Subset Columns (remove IsHoliday column)----
keep <- c("Store","Dept","Date","Weekly_Sales")
data <- data[, ..keep]
# Group Concatenation----
data[, GroupVar := do.call(paste, c(.SD, sep = " ")), .SDcols = c("Store","Dept")]
data[, c("Store","Dept") := NULL]
# Grab Unique List of GroupVar----
StoreDept <- unique(data[["GroupVar"]])
###########################################
# AutoTS() Builds----
###########################################
for(z in c(1,5,10,20,30)) {
TimerList <- list()
OutputList <- list()
l <- 0
for(i in StoreDept) {
l <- l + 1
temp <- data[GroupVar == eval(i)]
temp[, GroupVar := NULL]
TimerList[[i]] <- system.time(
OutputList[[i]] <- tryCatch({
RemixAutoML::AutoTS(
temp,
TargetName = "Weekly_Sales",
DateName = "Date",
FCPeriods = 1,
HoldOutPeriods = z,
EvaluationMetric = "MAPE",
TimeUnit = "week",
Lags = 25,
SLags = 1,
NumCores = 4,
SkipModels = NULL,
StepWise = TRUE,
TSClean = TRUE,
ModelFreq = TRUE,
PrintUpdates = FALSE)},
error = function(x) "Error in AutoTS run"))
print(l)
}
# Save Results When Done and Pull Them in After AutoCatBoostCARMA() Run----
save(TimerList, file = paste0(getwd(),"/TimerList_FC_",z,"_.R"))
save(OutputList, file = paste0(getwd(),"/OutputList_FC_",z,".R"))
rm(OutputList, TimerList)
}
###########################################
# Prepare data for AutoCatBoostCARMA()----
###########################################
# Load Walmart Data----
# link to manually download file: https://remixinstitute.app.box.com/v/walmart-store-sales-data/
data <- data.table::fread("https://remixinstitute.box.com/shared/static/9kzyttje3kd7l41y1e14to0akwl9vuje.csv", header = T, stringsAsFactors = FALSE)
# Subset for Stores / Departments With Full Series (143 time points each)----
data <- data[, Counts := .N, by = c("Store","Dept")][Counts == 143][, Counts := NULL]
# Subset Columns (remove IsHoliday column)----
keep <- c("Store","Dept","Date","Weekly_Sales")
data <- data[, ..keep]
# Build AutoCatBoostCARMA Models----
for(z in c(1,5,10,20,30)) {
CatBoostResults <- RemixAutoML::AutoCatBoostCARMA(
data,
TargetColumnName = "Weekly_Sales",
DateColumnName = "Date",
GroupVariables = c("Store","Dept"),
FC_Periods = 10,
TimeUnit = "week",
TargetTransformation = TRUE,
Lags = c(1:25,51,52,53),
MA_Periods = c(1:25,51,52,53),
CalendarVariables = TRUE,
TimeTrendVariable = TRUE,
HolidayVariable = TRUE,
DataTruncate = FALSE,
SplitRatios = c(1 - (30+z)/143, 30/143, z/143),
TaskType = "GPU",
EvalMetric = "RMSE",
GridTune = FALSE,
GridEvalMetric = "r2",
ModelCount = 2,
NTrees = 1500,
PartitionType = "timeseries",
Timer = TRUE)
# Output----
CatBoostResults$TimeSeriesPlot
CatBoost_Results <- CatBoostResults$ModelInformation$EvaluationMetricsByGroup
data.table::fwrite(CatBoost_Results, paste0(getwd(),"/CatBoost_Results_",z,".csv"))
rm(CatBoost_Results,CatBoostResults)
}
###########################################
# Prepare data for AutoXGBoostCARMA()----
###########################################
# Load Walmart Data ----
# link to manually download file: https://remixinstitute.app.box.com/v/walmart-store-sales-data/
data <- data.table::fread("https://remixinstitute.box.com/shared/static/9kzyttje3kd7l41y1e14to0akwl9vuje.csv", header = T, stringsAsFactors = FALSE)
# Subset for Stores / Departments With Full Series (143 time points each)----
data <- data[, Counts := .N, by = c("Store","Dept")][Counts == 143][, Counts := NULL]
# Subset Columns (remove IsHoliday column)----
keep <- c("Store","Dept","Date","Weekly_Sales")
data <- data[, ..keep]
for(z in c(1,5,10,20,30)) {
XGBoostResults <- RemixAutoML::AutoXGBoostCARMA(
data,
TargetColumnName = "Weekly_Sales",
DateColumnName = "Date",
GroupVariables = c("Store","Dept"),
FC_Periods = 2,
TimeUnit = "week",
TargetTransformation = TRUE,
Lags = c(1:25, 51, 52, 53),
MA_Periods = c(1:25, 51, 52, 53),
CalendarVariables = TRUE,
HolidayVariable = TRUE,
TimeTrendVariable = TRUE,
DataTruncate = FALSE,
SplitRatios = c(1 - (30+z)/143, 30/143, z/143),
TreeMethod = "hist",
EvalMetric = "MAE",
GridTune = FALSE,
GridEvalMetric = "mae",
ModelCount = 1,
NTrees = 5000,
PartitionType = "timeseries",
Timer = TRUE)
XGBoostResults$TimeSeriesPlot
XGBoost_Results <- XGBoostResults$ModelInformation$EvaluationMetricsByGroup
data.table::fwrite(XGBoost_Results, paste0(getwd(),"/XGBoost_Results",z,".csv"))
rm(XGBoost_Results)
}
###########################################
# Prepare data for AutoH2oDRFCARMA()----
###########################################
# Load Walmart Data ----
# link to manually download file: https://remixinstitute.app.box.com/v/walmart-store-sales-data/
data <- data.table::fread("https://remixinstitute.box.com/shared/static/9kzyttje3kd7l41y1e14to0akwl9vuje.csv", header = T, stringsAsFactors = FALSE)
# Subset for Stores / Departments With Full Series (143 time points each)----
data <- data[, Counts := .N, by = c("Store","Dept")][Counts == 143][, Counts := NULL]
# Subset Columns (remove IsHoliday column)----
keep <- c("Store","Dept","Date","Weekly_Sales")
data <- data[, ..keep]
for(z in c(1,5,10,20,30)) {
H2oDRFResults <- AutoH2oDRFCARMA(
data,
TargetColumnName = "Weekly_Sales",
DateColumnName = "Date",
GroupVariables = c("Store","Dept"),
FC_Periods = 2,
TimeUnit = "week",
TargetTransformation = TRUE,
Lags = c(1:5, 51,52,53),
MA_Periods = c(1:5, 51,52,53),
CalendarVariables = TRUE,
HolidayVariable = TRUE,
TimeTrendVariable = TRUE,
DataTruncate = FALSE,
SplitRatios = c(1 - (30+z)/143, 30/143, z/143),
EvalMetric = "MAE",
GridTune = FALSE,
ModelCount = 1,
NTrees = 2000,
PartitionType = "timeseries",
MaxMem = "28G",
NThreads = 8,
Timer = TRUE)
# Plot aggregate sales forecast (Stores and Departments rolled up into Total)----
H2oDRFResults$TimeSeriesPlot
H2oDRF_Results <- H2oDRFResults$ModelInformation$EvaluationMetricsByGroup
data.table::fwrite(H2oDRF_Results, paste0(getwd(),"/H2oDRF_Results",z,".csv"))
rm(H2oDRF_Results)
}
###########################################
# Prepare data for AutoH2OGBMCARMA()----
###########################################
# Load Walmart Data ----
# link to manually download file: https://remixinstitute.app.box.com/v/walmart-store-sales-data/
data <- data.table::fread("https://remixinstitute.box.com/shared/static/9kzyttje3kd7l41y1e14to0akwl9vuje.csv", header = T, stringsAsFactors = FALSE)
# Subset for Stores / Departments With Full Series (143 time points each)----
data <- data[, Counts := .N, by = c("Store","Dept")][Counts == 143][, Counts := NULL]
# Subset Columns (remove IsHoliday column)----
keep <- c("Store","Dept","Date","Weekly_Sales")
data <- data[, ..keep]
for(z in c(1,5,10,20,30)) {
H2oGBMResults <- AutoH2oGBMCARMA(
data,
TargetColumnName = "Weekly_Sales",
DateColumnName = "Date",
GroupVariables = c("Store","Dept"),
FC_Periods = 2,
TimeUnit = "week",
TargetTransformation = TRUE,
Lags = c(1:5, 51,52,53),
MA_Periods = c(1:5, 51,52,53),
CalendarVariables = TRUE,
HolidayVariable = TRUE,
TimeTrendVariable = TRUE,
DataTruncate = FALSE,
SplitRatios = c(1 - (30+z)/143, 30/143, z/143),
EvalMetric = "MAE",
GridTune = FALSE,
ModelCount = 1,
NTrees = 2000,
PartitionType = "timeseries",
MaxMem = "28G",
NThreads = 8,
Timer = TRUE)
# Plot aggregate sales forecast (Stores and Departments rolled up into Total)----
H2oGBMResults$TimeSeriesPlot
H2oGBM_Results <- H2oGBMResults$ModelInformation$EvaluationMetricsByGroup
data.table::fwrite(H2oGBM_Results, paste0(getwd(),"/H2oGBM_Results",z,".csv"))
rm(H2oGBM_Results)
}
##################################################
# AutoTS() and AutoCatBoostCARMA() Comparison----
##################################################
# Gather results----
for(i in c(1,5,10,20,30)) {
load(paste0("C:/Users/aantico/Desktop/Work/Remix/RemixAutoML/TimerList_",i,"_.R"))
load(paste0("C:/Users/aantico/Desktop/Work/Remix/RemixAutoML/OutputList_",i,"_.R"))
# Assemble TS Data
TimeList <- names(TimerList)
results <- list()
for(j in 1:2660) {
results[[j]] <- cbind(
StoreDept = TimeList[j],
tryCatch({OutputList[[j]]$EvaluationMetrics[, .(ModelName,MAE)][
, ModelName := gsub("_.*","",ModelName)
][
, ID := 1:.N, by = "ModelName"
][
ID == 1
][
, ID := NULL
]},
error = function(x) return(
data.table::data.table(
ModelName = "NONE",
MAE = NA))))
}
# AutoTS() Results----
Results <- data.table::rbindlist(results)
# Remove ModelName == NONE
Results <- Results[ModelName != "NONE"]
# Average out values: one per store and dept so straight avg works----
Results <- Results[, .(MAE = mean(MAE, na.rm = TRUE)), by = c("StoreDept","ModelName")]
# Group Concatenation----
Results[, c("Store","Dept") := data.table::tstrsplit(StoreDept, " ")][, StoreDept := NULL]
data.table::setcolorder(Results, c(3,4,1,2))
##################################
# Machine Learning Results----
##################################
# Load up CatBoost Results----
CatBoost_Results <- data.table::fread(paste0(getwd(),"/CatBoost_Results_",i,".csv"))
CatBoost_Results[, ':=' (MAPE_Metric = NULL, MSE_Metric = NULL, R2_Metric = NULL)]
data.table::setnames(CatBoost_Results, "MAE_Metric", "MAE")
CatBoost_Results[, ModelName := "CatBoost"]
data.table::setcolorder(CatBoost_Results, c(1,2,4,3))
# Load up XGBoost Results----
XGBoost_Results <- data.table::fread(paste0(getwd(),"/XGBoost_Results",i,".csv"))
XGBoost_Results[, ':=' (MAPE_Metric = NULL, MSE_Metric = NULL, R2_Metric = NULL)]
data.table::setnames(XGBoost_Results, "MAE_Metric", "MAE")
XGBoost_Results[, ModelName := "XGBoost"]
data.table::setcolorder(XGBoost_Results, c(1,2,4,3))
# Load up H2oDRF Results----
H2oDRF_Results <- data.table::fread(paste0(getwd(),"/H2oDRF_Results",i,".csv"))
H2oDRF_Results[, ':=' (MAPE_Metric = NULL, MSE_Metric = NULL, R2_Metric = NULL)]
data.table::setnames(H2oDRF_Results, "MAE_Metric", "MAE")
H2oDRF_Results[, ModelName := "H2oDRF"]
data.table::setcolorder(H2oDRF_Results, c(1,2,4,3))
# Load up H2oGBM Results----
H2oGBM_Results <- data.table::fread(paste0(getwd(),"/H2oGBM_Results",i,".csv"))
H2oGBM_Results[, ':=' (MAPE_Metric = NULL, MSE_Metric = NULL, R2_Metric = NULL)]
data.table::setnames(H2oGBM_Results, "MAE_Metric", "MAE")
H2oGBM_Results[, ModelName := "H2oGBM"]
data.table::setcolorder(H2oGBM_Results, c(1,2,4,3))
##################################
# Combine Data----
##################################
# Stack Files----
ModelDataEval <- data.table::rbindlist(
list(Results, CatBoost_Results, XGBoost_Results, H2oGBM_Results, H2oDRF_Results))
data.table::setorderv(ModelDataEval, cols = c("Store","Dept","MAE"))
# Add rank----
ModelDataEval[, Rank := 1:.N, by = c("Store","Dept")]
# Get Frequencies----
RankResults <- ModelDataEval[, .(Counts = .N), by = c("ModelName","Rank")]
data.table::setorderv(RankResults, c("Rank", "Counts"), order = c(1,-1))
# Final table----
FinalResultsTable <- data.table::dcast(RankResults, formula = ModelName ~ Rank, value.var = "Counts")
data.table::setorderv(FinalResultsTable, "1", -1, na.last = TRUE)
# Rename Columns----
for(k in 2:ncol(FinalResultsTable)) {
data.table::setnames(FinalResultsTable,
old = names(FinalResultsTable)[k],
new = paste0("Rank_",names(FinalResultsTable)[k]))
}
# Print
print(i)
print(knitr::kable(FinalResultsTable))
}
The post Why Machine Learning is more Practical than Econometrics in the Real World first appeared on Remix Institute.
The post Why Machine Learning is more Practical than Econometrics in the Real World appeared first on Remix Institute.
]]>The post Build Thousands of Automated Demand Forecasts in 15 Minutes Using AutoCatBoostCARMA in R first appeared on Remix Institute.
The post Build Thousands of Automated Demand Forecasts in 15 Minutes Using AutoCatBoostCARMA in R appeared first on Remix Institute.
]]>The post Build Thousands of Automated Demand Forecasts in 15 Minutes Using AutoCatBoostCARMA in R first appeared on Remix Institute.
The post Build Thousands of Automated Demand Forecasts in 15 Minutes Using AutoCatBoostCARMA in R appeared first on Remix Institute.
]]>The post Automate Your KPI Forecasts With Only 1 Line of R Code Using AutoTS first appeared on Remix Institute.
The post Automate Your KPI Forecasts With Only 1 Line of R Code Using AutoTS appeared first on Remix Institute.
]]>Automated forecasting is the process of automating data wrangling and data preparation of your time series data, splitting the data into training and holdout data, training several different time series models, testing each of those models onto a holdout data set to measure its accuracy, then choosing the most accurate model and re-fitting on the entire data set to create a forecast over a specified time horizon. This could typically take several steps and hundreds of lines of code, but AutoTS does this type of automated forecasting in a single line of code.
Typically, when companies are creating forecasts, they’re creating forecasts on a time series basis. That is, they are generating daily, weekly, monthly, quarterly or yearly forecasts.
Some examples of forecasting that we’ve seen at Fortune 500 companies and tech startups by industry are:
Some of the challenges of enterprise forecasting is doing so in an automated, scalable, and unbiased way. Too many times when creating forecasts, business unit stakeholders create complicated Excel spreadsheets, with lots of tabs and formulas and ugly formatting, using their own individual methodology, and leaving no process for how to update or reverse engineer. Often, when the employee(s) who manages those Excel spreadsheets leave(s) the company, the enterprise use of the forecast stops, and the process has to be re-built from scratch.
So this current process is neither automated (it requires specific personnel to manually update it), scalable (because Excel doesn’t scale, and the forecasts stop as soon as the employee leaves), nor unbiased (as the employee had their own individual methodology to forecast without giving insight into it). Additionally, forecasts at enterprises are generated by non-qualified, non-quantitative personnel with poor Excel skills and likely no coding or statistical background, resulting in forecast errors.
AutoTS stands for automated time series, and it automatically finds and creates the most accurate forecast from a list of 7 econometric time series models including ARIMA, Holt-Winters, and Autoregressive Neural Networks.
It’s a function inside the RemixAutoML package in the open-source programming language R. R is a popular programming language for data scientists and analysts that is used to build statistical and machine learning models along with data visualizations.
The beauty of AutoTS and RemixAutoML is their simplicity and ease of use. Even if you’ve never programmed in R, you can still use AutoTS easily. If you’ve ever used a function inside Excel like sum() or if() formulas, then you can code using AutoTS.
The logo of AutoTS is a robot sniper, which symbolizes automation and accuracy.
AutoTS solves the automation problem because it eliminates manual updates of Excel forecast templates and eliminates relying on an employee’s methodology with no oversight. This methodology was likely created by someone with a non-quantitative background, but AutoTS uses best-in-class statistical and machine learning models. So you won’t have to worry about inaccurate forecasts.
AutoTS solves the scalability problem since it’s open source and code-based, and therefore, by its nature, reproducible. It can also be integrated into several popular BI platforms that have R integration, such as Tableau and PowerBI, as well as drag-and-drop analytics platforms like Alteryx.
AutoTS solves the bias problem since it doesn’t rely on human judgement, intuition, or manual intervention. That’s typically what creates error and bad decision-making in the first place. AutoTS is machine learning and statistically based.
AutoTS produces accurate forecasts by running your data through 7 different econometric time series models and choosing the most accurate one that predicts best out-of-sample. Out-of-sample is defined as the holdout data set. Accuracy is defined as lowest mean absolute percentage error (MAPE).
The data set we’re using is weekly sales by Walmart store from Kaggle. The R code will do some basic data wrangling to get total sales by week for the highest grossing store, as the raw data set is by week, store, and department. If you have a internal company data set with a metric you want to forecast grouped by day, you can substitute it at Line 34, where “top_store_weekly_sales” is defined. Then change the TimeUnit in AutoTS to “day”.
You can see how few lines of code are needed to create accurate, automated, scalable, and unbiased forecasts using machine learning. No more messy spreadsheets. Technically, AutoTS only uses 1 line of R code, but we dedicated each function argument as its own line just for tutorial presentation purposes.
We drew some inspiration for branding the forecast plot output with RemixAutoML based on Michael Toth’s blog here.
library(RemixAutoML)
library(data.table)
library(dplyr)
library(magrittr)
library(ggplot2)
library(scales)
library(magick)
library(grid)
# IMPORT DATA FROM REMIX INSTITUTE BOX ACCOUNT ----------
# link to manually download file: https://remixinstitute.app.box.com/v/walmart-store-sales-data/
walmart_store_sales_data = data.table::fread("https://remixinstitute.box.com/shared/static/9kzyttje3kd7l41y1e14to0akwl9vuje.csv", header = T, stringsAsFactors = FALSE)
# FIND TOP GROSSING STORE (USING dplyr) ---------------------
# group by Store, sum Weekly Sales
top_grossing_store = walmart_store_sales_data %>% dplyr::group_by(., Store) %>%
dplyr::summarize(., Weekly_Sales = sum(Weekly_Sales, na.rm = TRUE))
# max Sales of 45 stores
max_sales = max(top_grossing_store$Weekly_Sales)
# find top grossing store
top_grossing_store = top_grossing_store %>% dplyr::filter(., Weekly_Sales == max_sales)
top_grossing_store = top_grossing_store$Store %>% as.numeric(.)
# what is the top grossing store?
print(paste("Store Number: ", top_grossing_store, sep = ""))
# FIND WEEKLY SALES DATA FOR TOP GROSSING STORE (USING data.table) ----------
top_store_weekly_sales <- walmart_store_sales_data[Store == eval(top_grossing_store),
.(Weekly_Sales = sum(Weekly_Sales, na.rm = TRUE)),
by = "Date"]
# FORECAST WEEKLY SALES FOR WALMART STORE USING AutoTS ------
# forecast for the next 16 weeks - technically 1 line of code, but
# each argument was dedicated its own line for presentation purposes
weekly_forecast = RemixAutoML::AutoTS(
data = top_store_weekly_sales,
TargetName = "Weekly_Sales",
DateName = "Date",
FCPeriods = 16,
HoldOutPeriods = 12,
TimeUnit = "week"
)
# VISUALIZE AutoTS FORECASTS ----------------
# view 16 week forecast
View(weekly_forecast$Forecast)
# View model evaluation metrics
View(weekly_forecast$EvaluationMetrics)
# which model won?
print(weekly_forecast$ChampionModel)
# see ggplot of forecasts
plot = weekly_forecast$TimeSeriesPlot
#change y-axis to currency
plot = plot + ggplot2::scale_y_continuous(labels = scales::dollar)
#RemixAutoML branding. Inspiration here: https://michaeltoth.me/you-need-to-start-branding-your-graphs-heres-how-with-ggplot.html
logo = magick::image_read("https://www.remixinstitute.com/wp-content/uploads/7b-Cheetah_Charcoal_Inline_No_Sub_No_BG.png")
plot
grid::grid.raster(logo, x = .73, y = 0.01, just = c('left', 'bottom'), width = 0.25)
The post Automate Your KPI Forecasts With Only 1 Line of R Code Using AutoTS first appeared on Remix Institute.
The post Automate Your KPI Forecasts With Only 1 Line of R Code Using AutoTS appeared first on Remix Institute.
]]>The post The Easiest Way to Create Thresholds And Improve Your Classification Model first appeared on Remix Institute.
The post The Easiest Way to Create Thresholds And Improve Your Classification Model appeared first on Remix Institute.
]]>The post The Easiest Way to Create Thresholds And Improve Your Classification Model first appeared on Remix Institute.
The post The Easiest Way to Create Thresholds And Improve Your Classification Model appeared first on Remix Institute.
]]>The post Companies Are Demanding Model Interpretability. Here’s How To Do It Right. first appeared on Remix Institute.
The post Companies Are Demanding Model Interpretability. Here’s How To Do It Right. appeared first on Remix Institute.
]]>
ICE and Partial Dependence Plots in LIME fail to tell me the accuracy surrounding the fitted relationship. Further, ICE doesn’t exactly tell me the probability of each of the lines occurring. Your model can overfit or underfit your data pretty easily, especially if you are using deep learning models. LIME (should be called LAME) fails to tell me how the model actually performs.
Imagine you are working on a price elasticity model that will guide pricing decisions. Currently you would show the relationship that the model was able to fit. Given that we will be using a model to guide pricing decisions, a sensible stakeholder might ask, “I see the relationship that your model fit, but how do I know that corresponds to the actual relationship?”
What do you do? Give the stakeholder some model accuracy metrics? Tell them that you used deep learning so they should just trust it because it is state-of-the-art technology?
Here is a simple solution to the shortfall of partial dependence plots: Use calibration on your predicted relationship. It’s that simple. Below is an example plot from the RemixAutoML package in R. The x-axis is the independent variable of interest. The spacing between ticks are based on percentiles of the distribution of the independent variable. What that means is that, across the x-axis, the data is uniformly distributed, so no need for the dashes as shown above in the ICE chart. Secondly, we can see the relationship of the independent variable as it relates to the target variable, as does the partial dependence plots, but we can also see how good a fit the model has across the range of the independent variable. This addresses the skepticism from your stakeholders about the accuracy of your predictions. If you want to see the variability of your predictions, use the boxplot version too. If you want to see the relationship for specific group, simply subset your data so only that group of interest is included, and rerun the function.
#######################################################
# Create data to simulate validation data with predicted values
#######################################################
# Correl: This is the correlation used to determine how correlated the variables are to
# the target variable. Switch it up (between 0 and 1) to see how the charts below change.
Correl <- 0.85
data <- data.table::data.table(Target = runif(1000))
# Mock independent variables - they are correlated variables with
# various transformations so you can see different kinds of relationships
# in the charts below
# Helper columns for creating simulated variables
data[, x1 := qnorm(Target)]
data[, x2 := runif(1000)]
# Create one variable at a time
data[, Independent_Variable1 := log(pnorm(Correl * x1 +
sqrt(1-Correl^2) * qnorm(x2)))]
data[, Independent_Variable2 := (pnorm(Correl * x1 +
sqrt(1-Correl^2) * qnorm(x2)))]
data[, Independent_Variable3 := exp(pnorm(Correl * x1 +
sqrt(1-Correl^2) * qnorm(x2)))]
data[, Independent_Variable4 := exp(exp(pnorm(Correl * x1 +
sqrt(1-Correl^2) * qnorm(x2))))]
data[, Independent_Variable5 := sqrt(pnorm(Correl * x1 +
sqrt(1-Correl^2) * qnorm(x2)))]
data[, Independent_Variable6 := (pnorm(Correl * x1 +
sqrt(1-Correl^2) * qnorm(x2)))^0.10]
data[, Independent_Variable7 := (pnorm(Correl * x1 +
sqrt(1-Correl^2) * qnorm(x2)))^0.25]
data[, Independent_Variable8 := (pnorm(Correl * x1 +
sqrt(1-Correl^2) * qnorm(x2)))^0.75]
data[, Independent_Variable9 := (pnorm(Correl * x1 +
sqrt(1-Correl^2) * qnorm(x2)))^2]
data[, Independent_Variable10 := (pnorm(Correl * x1 +
sqrt(1-Correl^2) * qnorm(x2)))^4]
data[, Independent_Variable11 := ifelse(Independent_Variable2 < 0.20, "A",
ifelse(Independent_Variable2 < 0.40, "B",
ifelse(Independent_Variable2 < 0.6, "C",
ifelse(Independent_Variable2 < 0.8, "D", "E"))))]
# We’ll use this as a mock predicted value
data[, Predict := (pnorm(Correl * x1 +
sqrt(1-Correl^2) * qnorm(x2)))]
# Remove the helper columns
data[, ':=' (x1 = NULL, x2 = NULL)]
# In the ParDepCalPlot() function below, note the Function argument -
# we are using mean() to aggregate our values but you
# can use quantile(x, probs = y) for quantile regression
# Partial Dependence Calibration Plot:
p1 <- RemixAutoML::ParDepCalPlots(data,
PredictionColName = "Predict",
TargetColName = "Target",
IndepVar = "Independent_Variable1",
GraphType = "calibration",
PercentileBucket = 0.05,
FactLevels = 10,
Function = function(x) mean(x, na.rm = TRUE))
# Partial Dependence Calibration BoxPlot: note the GraphType argument
p2 <- RemixAutoML::ParDepCalPlots(data,
PredictionColName = "Predict",
TargetColName = "Target",
IndepVar = "Independent_Variable1",
GraphType = "boxplot",
PercentileBucket = 0.05,
FactLevels = 10,
Function = function(x) mean(x, na.rm = TRUE))
# Partial Dependence Calibration Plot:
p3 <- RemixAutoML::ParDepCalPlots(data,
PredictionColName = "Predict",
TargetColName = "Target",
IndepVar = "Independent_Variable4",
GraphType = "calibration",
PercentileBucket = 0.05,
FactLevels = 10,
Function = function(x) mean(x, na.rm = TRUE))
# Partial Dependence Calibration BoxPlot for factor variables:
p4 <- RemixAutoML::ParDepCalPlots(data,
PredictionColName = "Predict",
TargetColName = "Target",
IndepVar = "Independent_Variable11",
GraphType = "calibration",
PercentileBucket = 0.05,
FactLevels = 10,
Function = function(x) mean(x, na.rm = TRUE))
# Plot all the individual graphs in a single pane
RemixAutoML::multiplot(plotlist = list(p1,p2,p3,p4), cols = 2)
The post Companies Are Demanding Model Interpretability. Here’s How To Do It Right. first appeared on Remix Institute.
The post Companies Are Demanding Model Interpretability. Here’s How To Do It Right. appeared first on Remix Institute.
]]>The post Why Data Scientist is the Sexiest Job in the Country first appeared on Remix Institute.
The post Why Data Scientist is the Sexiest Job in the Country appeared first on Remix Institute.
]]>The post Why Data Scientist is the Sexiest Job in the Country first appeared on Remix Institute.
The post Why Data Scientist is the Sexiest Job in the Country appeared first on Remix Institute.
]]>