Introduction to Data Preparation in R
23 minute read- Course Introduction
- Who Would Be Interested In This Course?
- Pre-Requisites
- Software Required
- Installing Necessary R Libraries
- What is Data Preparation or Data Wrangling?
- Business And Automation Concepts To Understand
- What Is The Business Problem That We’ll Solve In This Course?
- What Data Will We Be Using To Solve The Business Problem?
- What Are Data Frames?
- Different Data Types In A Data Frame
- Reading and Writing CSV Files As Data Frames
- Understanding The Columns And Data Types Of Your Data Frame
- Creating New Columns In A Data Frame
- Filtering A Data Frame
- Selecting Columns In A Data Frame
- Grouping and Summarizing A Data Frame in R
- Merging And Joining Data Frames in R
- Final Solution
- Full R Code Used In The Course
- Citation
Course Introduction
The purpose of this course is to teach you the 80/20 (Pareto Optimal) knowledge of data preparation, data wrangling, data cleaning, and data engineering tasks that are commonly used in data science and analytics projects.
That is, after taking this course, you’ll know the 20% of knowledge that will allow you to tackle 80% of tasks.
We will be teaching you two popular packages in R for data preparation: dplyr
and data.table
.
It is commonly reported that the majority of time spent by data scientists and analysts is data preparation and data wrangling which is why learning the skills in this course is important.
Who Would Be Interested In This Course?
If you meet any of these criteria below, then this course would be highly interesting for you:
- You want to learn how to automate common data wrangling and data preparation tasks so you can save a lot of time and become more efficient and productive.
- You want to learn how to migrate your non-reproducible Excel files or data preparation work flows to reproducible and automated work flows so you don’t have to reverse engineer yours or other people’s work.
- You want to get your feet wet in learning a data science and AI programming language like R to stay ahead of current technology trends
- You’re a non-technical manager and just want to understand common tasks involved in data science and analytics so you can become more data savvy.
- You’re not an analyst or data scientist but feel like learning how to automate data preparation tasks would be helpful for you in your career.
- You’re a Python or Julia data scientist who wants to evaluate similarity and differences of common data wrangling tasks in R
Pre-Requisites
- Basic Knowledge of Excel
- Helpful but not required: basic knowledge of SQL
Software Required
The only tools you’ll need are:
- R
- RStudio
dplyr
,data.table
, andmagrittr
libraries in R
R is an open source programming language that specializes in data science, data engineering, data analysis, statistical modeling, machine learning, and artificial intelligence (AI)
RStudio is an IDE for R. An IDE stands for an “integrated development environment”, which is a software application for editing and debugging code.
You can learn more about R and RStudio and how to install each in the Remyx Course titled How To Download and Install R.
Installing Necessary R Libraries
R packages (or libraries) are collections of R functions, data, and code wrapped in a usable format. Think of R packages as you would as sets of Excel functions like SUM or VLOOKUP, which means they’re easy to use and out-of-the-box with no need of low-level programming. In fact, the R function sum() does exactly the same thing as the Excel function sum().
When you first download R, it comes with a set of base packages, and thousands of others are available for download. There are currently over 10,000 packages (or libraries) available on CRAN, which is the global repository of R packages. Also note: In R, the words “library” and “package” are used interchangeably and mean the same thing.
To install the libraries needed for this course in R (dplyr
, data.table
, and magrittr
), open RStudio, copy the following R code in the Code Editor, and click Run.
R Code for Installing and Loading R Packages
# install packages - only needs to be run once
install.packages("data.table")
install.packages("dplyr")
install.packages("magrittr")
# load packages
library(data.table)
library(dplyr)
library(magrittr)
What is Data Preparation or Data Wrangling?
Data Preparation is one of the 6 phases of the CRISP-DM Data Science Process Model. CRISP-DM stands for “cross industry standard process for data mining” and was developed by companies like IBM, Teradata, Mercedes Benz Group (Daimler AG), and NCR Corporation.
Data Preparation (or Data Wrangling) involves everything related to data cleaning, data QA, imputations, handling of missing data, data filtering and subsets, groupings, summaries, joins, derived columns, and many other steps. It’s been said that 80% of a data scientist and data analyst’s time is spent on the Data Preparation phase.
You can learn more about CRISP-DM in the Remyx Course titled CRISP-DM Data Science Process Model.
Business And Automation Concepts To Understand
There’s 3 key business and automation concepts to understand that highlight the importance of learning Data Preparation in a programming language like R.
- Exploratory Data Analysis (EDA) – Since the Data Preparation phase is one of the key phases in CRISP-DM, without doing proper data preparation, you will be unable to perform Exploratory Data Analysis (EDA). EDA basically means analyzing and exploring the data set to formulate hypotheses, inform statistical and machine learning model building, provide decision support, and suggest new data collection methods and experiments. EDA has been promoted by statistician John Tukey since 1970 to encourage data scientists, statisticians, and data analysts to explore the data first before moving onto later stages.
- Reproducibility – Automation requires that a process be repeatable and reproducible. Doing data preparation in Excel is fine, but oftentimes, the process for preparing that data in Excel is manual, and the steps are irreproducible except by the original author. Doing data preparation in a programming language like R helps automate the process and shows any user who looks at the code exactly which steps were taken to prepare the data.
- Speed-To-Market – Having automation also means that you gain a speed-to-market advantage. That is, you can prepare data faster and easier since the code is already automated which allows you to throughput more models, analytics, and products faster than your competitors.
What Is The Business Problem That We’ll Solve In This Course?
Objective: Using R’s data preparation capabilities, we want to find the top 10 items ordered at a UK-based online retail company by customers in Ireland.
The company mainly sells unique all-occasion gifts, and many of its customers are wholesalers.
What Data Will We Be Using To Solve The Business Problem?
Before you can even begin the Data Preparation phase of CRISP-DM, you need to understand what data you have and what data you don’t have but need. This is called the Data Understanding phase of the CRISP-DM process model.
The dataset we will be using for this course comes from the University of California Irvine (UCI) Machine Learning Repository. The link to the dataset can be found by clicking here. The data contains transactions occurring between January 12, 2010 and September 12, 2011 for a UK-based and registered non-store online retail company. The company mainly sells unique all-occasion gifts, and many of its customers are wholesalers.
We like this dataset because it looks similar to datasets you’d encounter in real-world business settings. Typically, you’d be able to access transaction datasets like this one from your company’s data warehouse.
What Are Data Frames?
Think of Data Frames much like you would a spreadsheet in Excel or a table in an SQL database. A Data Frame is a data structure which organizes data into a 2-dimensional table of rows and columns.
A Data Frame in R is exactly what you expect an Excel spreadsheet to look like which is data displayed in table format with rows and columns.
Different Data Types In A Data Frame
Values inside the columns in a Data Frame are stored as different types. The most common types of data stored in a Data Frame in R are:
- Numeric
- Integer
- Character (aka String)
- Logical (aka Boolean)
A table that describes each data type and an example of that data type is below:
Data Type | Description | Examples |
---|---|---|
Numeric | The value inside the column is stored as a number with decimal places | 19.79, 4.1119 |
Integer | The value inside the column is stored as a number without decimal places | -100, -2, -1, 0, 1, 2, 100 |
Character | The value inside the column is stored as a text or string | “Remyx Courses”, “artificial intelligence” |
Logical | The value inside the column is stored as a Boolean value (ie True or False) | TRUE, FALSE |
Reading and Writing CSV Files As Data Frames
One common way of getting data you need is by reading in data that’s stored in tabular format in a CSV file (which stands for Comma Separated Values). A comma-separated values (CSV) file is a text file that uses a comma to separate values. Each line (or row) of the file is a data record. Each record consists of one or more columns, separated by commas. A CSV file looks like a spreadsheet when you open it and ends with the .csv filename extension.
We’ve uploaded the dataset from UCI Machine Learning Repository to Dropbox specifically for this course so it can be easily read directly into R as a CSV. This eliminates the need for a manual download of the Excel file to your local directory, saving it as .CSV, and then writing code in R to read from that specific directory. We’ve basically made the process easy for you.
Reading In/Importing CSV Files
To read in the Online Retail CSV dataset as a Data Frame in R, you’d use the fread
function from the data.table
package in R. The name fread
stands for “fast read”, and it’s Remix Institute’s only recommended way of reading in CSV files. The code for doing that is below:
R Code for Reading In CSVs
# Read In/Import into R the Online Retail CSV dataset from Remix Institute's Dropbox
online_retail_data = data.table::fread("https://www.dropbox.com/s/ygecmz70oy5ch9i/Online%20Retail.csv?dl=1", header = TRUE, stringsAsFactors = FALSE)
The string inside the quotation marks is the location of the CSV file you want to read in. The header = TRUE
argument means that the first row of the CSV file contains the column (header) names.
Writing/Exporting CSV Files
To write out a Data Frame in R as a CSV file to a local directory, you’d use the fwrite
function from the data.table
package in R. The name fwrite
stands for “fast write”, and it’s Remix Institute’s only recommended way of writing out CSV files. The code for doing that is below:
R Code for Writing to CSV
# Write Out/Export an R Data Frame as CSV to your local directory
# 1. Replace online_retail_data with the name of the dataframe you want to export
# 2. Replace the string inside the quotations to the location of the directory and filename you want to export to
data.table::fwrite(online_retail_data, "C:/Users/RemixLearner/Documents/online_retail_data.csv")
Understanding The Columns And Data Types Of Your Data Frame
According to University of California Irvine Machine Learning Repository’s website, the column and data type information for the dataset is in the table below. If you want to see a brief summary of the column names and data types in R, you can run str(online_retail_data)
in R.
Column | Description | Data Type |
---|---|---|
InvoiceNo | Invoice number. Nominal, a 6-digit integral number uniquely assigned to each transaction. If this code starts with letter ‘c’, it indicates a cancellation. | Character |
StockCode | Product (item) code. Nominal, a 5-digit integral number uniquely assigned to each distinct product. | Character |
Description | Product (item) name. Nominal. | Character |
Quantity | The quantities of each product (item) per transaction. Numeric. | Integer |
InvoiceDate | Invoice Date and time. Numeric, the day and time when each transaction was generated | Integer |
UnitPrice | Unit price. Numeric, Product price per unit in sterling. | Numeric |
CustomerID | Customer number. Nominal, a 5-digit integral number uniquely assigned to each customer. | Integer |
Country | Country name. Nominal, the name of the country where each customer resides | Character |
Creating New Columns In A Data Frame
You’ll notice, based on UCI’s documentation, that any invoice numbers that start with ‘c’ are cancellations. We’re going to create a new column in the Data Frame called CancelledInvoiceFlag which indicates if the invoice was cancelled. We give it a value of 1 if the invoice is cancelled and a value of 0 if it’s not cancelled.
In the R code below, we show two ways for creating new columns in R (see R code comments). You can use either way you prefer. Both ways perform the exact same operation. Notice the substring
function which starts at position 1 of the character value and ends at position 1 of the character value in order to extract the first letter of the InvoiceNo.
Also, notice the ifelse
function which acts exactly like the IF-THEN formula in Excel.
To create new columns in R, the syntax to use can either be dataframe$column_name
or dataframe[ , "column_name"]
where dataframe
is the name of your dataframe and column_name
is the name of the column you want to create.
R Code for Creating a New Column called “CancelledInvoiceFlag”
# First Way - create a flag for cancelled invoices
online_retail_data[ , "CancelledInvoiceFlag"] = ifelse(substring(online_retail_data[ , "InvoiceNo"], 1, 1) == 'C', 1, 0)
# or Second Way - create a flag for cancelled invoices
online_retail_data$CancelledInvoiceFlag = ifelse(substring(online_retail_data$InvoiceNo, 1, 1) == 'C', 1, 0)
# view the dataset to see the new column
View(online_retail_data)
We’ll also create another column called NegativeQuantityFlag which indicates if the value in the Quantity column is negative. We give it a value of 1 if the Quantity column is negative and a value of 0 if the Quantity column is not negative. Like we did for “CancelledInvoiceFlag”, the R code below shows two ways for creating the new column in R (see R code comments). You can use either way you prefer. Both ways perform the exact same operation.
R Code for Creating a New Column called “NegativeQuantityFlag”
# First Way - create a flag for negative quantities
online_retail_data[, "NegativeQuantityFlag"] = ifelse(online_retail_data[, "Quantity"] < 0, 1, 0)
# or Second Way - create a flag for negative quantities
online_retail_data$NegativeQuantityFlag = ifelse(online_retail_data$Quantity < 0, 1, 0)
# view the dataset to see the new column
View(online_retail_data)
Filtering A Data Frame
Filtering a data frame is the process of taking a subset or smaller part of the full dataset based on certain conditions that you specify. The conditions are applied to the columns of the data frame. You’ve probably worked with filters in Excel and SQL so you’d be familiar with the concept.
Based on the new columns we created in the “Creating New Columns In A Data Frame” section, we’re going to filter our data frame based on specific conditions in those columns.
In any real world business scenario, sometimes it’s important to remove any cancellations or refunds in your analysis and model. You don’t want to attribute sales to the transaction if the customer cancelled the order or returned the product for a refund. For our first filter, we will be removing cancellations from the dataset using the newly created “CancelledInvoiceFlag.”
Also, if you study the dataset, you’ll notice that there are some non-cancelled invoices where Quantity is a negative number. Many times, these non-cancelled invoices with a negative Quantity number have a StockCode but no Description or CustomerID. These look like bad data points and should be removed from the dataset before analysis or modeling. For our second filter, we will be removing records with a negative Quantity from the dataset using the newly created “NegativeQuantityFlag.”
These two filters will remove 10,624 rows from the data frame.
We’ll show you 2 ways to filter a data frame in R. The first way is by using the data.table
package in R, and the second way is by using the dplyr
package in R. You can use either way that you prefer. They all perform the exact same operation. However, the data.table
package can perform these operations much faster than the dplyr
package. So if you’re optimizing for speed and performance, you should use the data.table
package. If you’re optimizing for code readability, you should use the dplyr
package.
Note: In SQL and Excel, the logical notation for “not equal to” is <>
, but the logical notation for “not equal to” in R is !=
Also, note the logical operator &
which means “and” and is similar to the “and” operator in SQL and Excel.
R Code for Filtering a Data Frame
# how many rows in the data frame?
print(paste0("Number of rows in data frame: ", nrow(online_retail_data))
# First Way - filter a data frame using the subset() function from base R and data.table package.
# Here we want to remove cancelled invoices and negative quantitites. Notice the != means "not equal to".
online_retail_data1 = subset(online_retail_data, CancelledInvoiceFlag != 1 & NegativeQuantityFlag != 1)
# Second Way - filter a data frame using the filter() function from the dplyr package.
# Here we want to remove cancelled invoices and negative quantities. Notice the != means "not equal to".
online_retail_data2 = online_retail_data %>% dplyr::filter(CancelledInvoiceFlag != 1 & NegativeQuantityFlag != 1)
# how many rows in the data frame after the filter?
print(paste0("Number of rows in filtered data frame: ", nrow(online_retail_data1))
Selecting Columns In A Data Frame
Selecting columns in a data frame is exactly what it sounds like: you choose which columns in the data frame to keep and which ones to remove, similar to a SQL select
clause.
There are 3,941 unique StockCodes (or product item codes) in the Online Retail dataset after we applied the two filters from “Filtering a Data Frame” section. You can check this by running length(unique(online_retail_data1$StockCode))
in R.
For the Business Problem to solve that was addressed earlier in the course, we want to see what are the top 10 most popular items by Country. To do this, we select only the columns of interest: StockCode, Description, Quantity, and Country.
We’ll show you 2 ways to select columns in a data frame in R. The first way is by using the data.table
package in R, and the second way is by using the dplyr
package in R. You can use either way that you prefer. They all perform the exact same operation. However, the data.table
package can perform these operations much faster than the dplyr
package. So if you’re optimizing for speed and performance, you should use the data.table
package. If you’re optimizing for code readability, you should use the dplyr
package.
R Code for Selecting Columns in a Data Frame
# First Way - select columns in a data frame using the data.table package
online_retail_data1 = online_retail_data1[, .(StockCode, Description, Quantity, Country)]
# Second Way - select columns in a data frame using the dplyr package
online_retail_data2 = online_retail_data2 %>% dplyr::select(StockCode, Description, Quantity, Country)
Grouping and Summarizing A Data Frame in R
Grouping in R is similar to SQL group by
clauses and aggregating rows in Excel PivotTables. Grouping allows you to aggregate rows based on unique values of one or more columns. It returns one row for each group.
Just like SQL group by
clauses, grouping is often used with aggregation functions such as sum, count, min, max, and average to summarize the results of each group.
For example, when we want to find the top 10 most popular items by Country, then we’re grouping by Country and StockCode and finding the sum of the order Quantity for each group.
The next block of R code will do a sum() of Quantity (and call it TotalQuantity) and group by Country, StockCode, and Description.
Grouping and summarizing in R can be done 2 ways in R. The first way is by using the data.table
package in R, and the second way is by using the dplyr
package in R. You can use either way that you prefer. They all perform the exact same operation. However, the data.table
package can perform these operations much faster than the dplyr
package. So if you’re optimizing for speed and performance, you should use the data.table
package. If you’re optimizing for code readability, you should use the dplyr
package.
R Code for Grouping and Summarizing a Data Frame
# First Way - group by Country, StockCode, Description and sum up Quantity using the data.table package
online_retail_data1_summary = online_retail_data1[, .(TotalQuantity = sum(Quantity)), by = .(Country, StockCode, Description)]
# Second Way - group by Country, StockCode, Description and sum up Quantity using the dplyr package
online_retail_data2_summary = online_retail_data2 %>%
dplyr::group_by(Country, StockCode, Description) %>%
dplyr::summarize(TotalQuantity = sum(Quantity))
Merging And Joining Data Frames in R
A join (or merge) in R is the process of combining rows from two or more data frames (or tables) based on a common column between them. It’s the same as a SQL join
, and it’s also the same as a VLOOKUP in Excel.
We have a table of Alpha 2 and Alpha 3 Country Codes based on ISO 3166 international standards which we want to join (or merge) to our summary data frames based on the Country column.
We would use the merge()
function in R to do this. In this case, we will be doing a Left Join, meaning we want to keep all the elements in the first data frame and join the Alpha 2 and Alpha 3 country codes if it finds an associated Country name.
If you don’t know what Joins are, then for a more comprehensive understanding of SQL joins, you can take the Remyx Courses course on SQL joins.
The data frame we’ll be joining is a table that has 3 columns: Country, Alpha 2 code, and Alpha 3 code. They are all character data types. These Alpha codes are used throughout the IT industry by software and computer systems to ease the identification of country name.
Notice the all.x = TRUE
argument in the merge
function which means it’s doing a LEFT JOIN.
R Code for Merging/Joining Two Data Frames
# Read in CSV from Remix Institute Dropbox
# Original Source: https://www.iban.com/country-codes
iso_3166_country_codes = data.table::fread("https://www.dropbox.com/s/5g6z1zpa560qwf6/Online%20Retail%20Data%20-%20ISO%203166%20Country%20Codes%20Alpha%202%20and%20Alpha%203.csv?dl=1", header = TRUE, stringsAsFactors = FALSE)
# Merge and do a Left Join of online_retail_data1_summary and iso_3166_country_codes on the Country column
online_retail_data_summary = merge(online_retail_data1_summary, iso_3166_country_codes, all.x = TRUE, on = "Country")
Final Solution
You learned the 80/20 (Pareto Optimal) knowledge of data preparation steps using R.
You learned how to read and import CSVs, how to work with Data Frames, how to understand the data, how to create new columns, how to filter, how to create groupings, and summaries, and how to do merges and joins. All inside of R.
Now we have all the data cleaned and prepared to answer our Business Problem addressed earlier in the course: Find the top 10 items ordered at a UK-based online retail company by customers in Ireland.
To do this, you run 3 more steps:
- Sort TotalQuantity from highest to lowest using the
order
function in R. - Do a filter for country code “IE” which stands for Ireland
- Do one more filter for taking just the top 10 rows in the data frame using the
head
function in R
The R code to do this is below. If you did everything correctly, you should see a final output that looks like this. You can see that the top item ordered in Ireland was the “PACK OF 72 RETROSPOT CAKE CASES.”
Go Raibh Maith Agat! Slán!
R Code To Generate Final Solution
# sort TotalQuantity from highest to lowest
online_retail_data_summary = online_retail_data_summary[order(TotalQuantity, decreasing = TRUE), ]
# filter for Ireland (IE) using Alpha2Code_ISO_3166 column
online_retail_data_ireland_summary = subset(online_retail_data_summary, Alpha2Code_ISO_3166 == 'IE')
# filter for top 10 rows
online_retail_data_ireland_summary = head(online_retail_data_ireland_summary, n = 10)
# view the Final Solution
View(online_retail_data_ireland_summary)
Notice that decreasing is set to TRUE meaning you’re sorting by highest to lowest. If decreasing was set to FALSE, it would mean sorted by lowest to highest.
Full R Code Used In The Course
# Install and Load R Libraries ----
# install packages - only needs to be run once
install.packages("data.table")
install.packages("dplyr")
install.packages("magrittr")
# load packages
library(data.table)
library(dplyr)
library(magrittr)
# Read and Write CSV Files --------
# Read in CSV from Remix Institute Dropbox
# Original Source: UCI Machine Learning Repository - https://archive.ics.uci.edu/ml/datasets/online+retail
online_retail_data = data.table::fread("https://www.dropbox.com/s/ygecmz70oy5ch9i/Online%20Retail.csv?dl=1", header = TRUE, stringsAsFactors = FALSE)
# view the dataset
View(online_retail_data)
# Write Out/Export an R Data Frame as CSV to your local directory
# 1. Replace online_retail_data with the name of the dataframe you want to export
# 2. Replace the string inside the quotations to the location of the directory and filename you want to export to
data.table::fwrite(online_retail_data, "C:/Users/RemixLearner/Documents/online_retail_data.csv")
# Create New Columns -------
# First Way - create a flag for cancelled invoices
online_retail_data[ , "CancelledInvoiceFlag"] = ifelse(substring(online_retail_data[ , "InvoiceNo"], 1, 1) == 'C', 1, 0)
# or Second Way - create a flag for cancelled invoices
online_retail_data$CancelledInvoiceFlag = ifelse(substring(online_retail_data$InvoiceNo, 1, 1) == 'C', 1, 0)
# view the dataset to see the new column
View(online_retail_data)
# First Way - create a flag for negative quantities
online_retail_data[, "NegativeQuantityFlag"] = ifelse(online_retail_data[, "Quantity"] < 0, 1, 0)
# or Second Way - create a flag for negative quantities
online_retail_data$NegativeQuantityFlag = ifelse(online_retail_data$Quantity < 0, 1, 0)
# view the dataset to see the new column
View(online_retail_data)
# Filter A Data Frame ---------
# how many rows in the data frame?
print(paste0("Number of rows in data frame: ", nrow(online_retail_data))
# First Way - filter a data frame using the subset() function from base R and data.table package.
# Here we want to remove cancelled invoices and negative quantitites. Notice the != means "not equal to".
online_retail_data1 = subset(online_retail_data, CancelledInvoiceFlag != 1 & NegativeQuantityFlag != 1)
# Second Way - filter a data frame using the filter() function from the dplyr package.
# Here we want to remove cancelled invoices and negative quantities. Notice the != means "not equal to".
online_retail_data2 = online_retail_data %>% dplyr::filter(CancelledInvoiceFlag != 1 & NegativeQuantityFlag != 1)
# how many rows in the data frame after the filter?
print(paste0("Number of rows in filtered data frame: ", nrow(online_retail_data1))
# Selecting Columns In A Data Frame --------
# First Way - select columns in a data frame using the data.table package
online_retail_data1 = online_retail_data1[, .(StockCode, Description, Quantity, Country)]
# Second Way - select columns in a data frame using the dplyr package
online_retail_data2 = online_retail_data2 %>% dplyr::select(StockCode, Description, Quantity, Country)
# Grouping and Summarizing a Data Frame --------
# First Way - group by Country, StockCode, Description and sum up Quantity using the data.table package
online_retail_data1_summary = online_retail_data1[, .(TotalQuantity = sum(Quantity)), by = .(Country, StockCode, Description)]
# Second Way - group by Country, StockCode, Description and sum up Quantity using the dplyr package
online_retail_data2_summary = online_retail_data2 %>%
dplyr::group_by(Country, StockCode, Description) %>%
dplyr::summarize(TotalQuantity = sum(Quantity))
# Merging and Joining a Data Frame ---------
# Read in CSV from Remix Institute Dropbox
# Original Source: https://www.iban.com/country-codes
iso_3166_country_codes = data.table::fread("https://www.dropbox.com/s/5g6z1zpa560qwf6/Online%20Retail%20Data%20-%20ISO%203166%20Country%20Codes%20Alpha%202%20and%20Alpha%203.csv?dl=1", header = TRUE, stringsAsFactors = FALSE)
# Merge and do a Left Join of online_retail_data1_summary and iso_3166_country_codes on the Country column
online_retail_data_summary = merge(online_retail_data1_summary, iso_3166_country_codes, all.x = TRUE, on = "Country")
# Final Solution ------------
# sort TotalQuantity from highest to lowest
online_retail_data_summary = online_retail_data_summary[order(TotalQuantity, decreasing = TRUE), ]
# filter for Ireland (IE) using Alpha2Code_ISO_3166 column
online_retail_data_ireland_summary = subset(online_retail_data_summary, Alpha2Code_ISO_3166 == 'IE')
# filter for top 10 rows
online_retail_data_ireland_summary = head(online_retail_data_ireland_summary, n = 10)
# view the Final Solution
View(online_retail_data_ireland_summary)
Citation
Daqing Chen, Sai Liang Sain, and Kun Guo, Data mining for the online retail industry: A case study of RFM model-based customer segmentation using data mining, Journal of Database Marketing and Customer Strategy Management, Vol. 19, No. 3, pp. 197–208, 2012 (Published online before print: 27 August 2012. doi: 10.1057/dbm.2012.17).