Introduction to Data Preparation in Julia

Kuizu

Introduction to Data Preparation in Julia

7 minute read

Course Introduction

The purpose of this course is to teach you the 80/20 (Pareto Optimal) knowledge of data preparation, data wrangling, data cleaning, and data engineering tasks that are commonly used in data science and analytics projects.

We will be teaching you two popular packages in Julia for data preparation: DataFrames.jl and CSV.jl.

It is commonly reported that the majority of time spent by data scientists and analysts is data preparation and data wrangling.



Who Would Be Interested In This Course?

If you meet any of these criteria below, then this course would be highly interesting for you:

  • You want to learn how to automate common data wrangling and data preparation tasks so you can save a lot of time and become more efficient and productive.
  • You want to learn how to migrate your non-reproducible Excel files or data preparation work flows to reproducible and automated work flows so you don’t have to reverse engineer other people’s work.
  • You want to get your feet wet in learning a data science and AI programming language like Julia to stay ahead of current technology trends
  • You’re a non-technical manager and just want to understand common tasks involved in data science and analytics so you can become more data savvy.
  • You’re not an analyst or data scientist but feel like learning how to automate data preparation tasks would be helpful for you in your career.


Pre-Requisites

  • Intermediate Knowledge of Excel
  • Helpful but not required: basic knowledge of SQL


Software Required

The only three tools you’ll need are:

  • Julia
  • Pluto
  • DataFrames and CSV libraries in Julia

Julia is an open source programming language that specializes in data science, data engineering, data analysis, statistical modeling, machine learning, and artificial intelligence (AI)

Pluto is a Notebook editor for Julia….

You can learn more about Julia and Pluto and how to install each in the Remyx Course titled How To Download and Install R.



Installing Necessary R Libraries

R packages (or libraries) are collections of R functions, data, and code wrapped in a usable format. Think of them as you would as sets of Excel functions like sum() or vlookup(), which means they’re easy to use out-of-the-box with no need of low-level programming. In fact, the R function sum() does exactly the same thing as the Excel function sum().

When you first download R, it comes with a set of base packages, and thousands of others are available for download. There are currently over 10,000 packages (or libraries) available on CRAN, which is the global repository of R packages.

To install the libraries needed for this course in R (dplyr and data.table), open RStudio, copy the following R code in the Code Editor, and click Run.

R Code for Installing R Packages

# Install the dplyr and data.table libraries in R

install.packages("dplyr")
install.packages("data.table")


What is Data Preparation or Data Wrangling?

Business and Automation Concepts To Understand

  • crisp-dm
  • reproducibility
  • speed-to-market

What are Data Frames?

Think of Data Frames much like you would a spreadsheet in Excel or a table in an SQL database. A Data Frame is a data structure which organizes data into a 2-dimensional table of rows and columns.

A Data Frame in Julia is exactly what you expect an Excel spreadsheet to look like which is data displayed in table format with rows and columns.



Different Data Types In A Data Frame

Values inside the columns in a Data Frame are stored as different types. The most common types of data stored in a Data Frame in Julia are:

  • Numeric
  • Integer
  • Character (aka String)
  • Logical (aka Boolean)

A table that describes each data type and an example of that data type is below:

Data TypeDescriptionExamples
NumericThe value inside the column is stored as a number with decimal places19.79, 4.1119
IntegerThe value inside the column is stored as a number without decimal places-100, -2, -1, 0, 1, 2, 100
CharacterThe value inside the column is stored as a text or string“Remyx Courses”, “artificial intelligence”
LogicalThe value inside the column is stored as a Boolean value (ie True or False)TRUE, FALSE



Reading and Writing CSV Files As Data Frames

Before you can even begin the Data Preparation phase of CRISP-DM, you need to understand what data you have and what data you don’t have but need. This is called the Data Understanding phase of the CRISP-DM process model.

One common way of getting data you need is by reading in data that’s stored in tabular format in a CSV file (which stands for Comma Separated Values). A comma-separated values (CSV) file is a text file that uses a comma to separate values. Each line (or row) of the file is a data record. Each record consists of one or more columns, separated by commas. A CSV file looks like a spreadsheet when you open it and ends with the .csv filename extension.

The dataset we will be using for this course comes from the University of California Irvine (UCI) Machine Learning Repository. The link to the dataset can be found by clicking here. The data contains transactions occurring between January 12, 2010 and September 12, 2011 for a UK-based and registered non-store online retail company. The company mainly sells unique all-occasion gifts, and many of its customers are wholesalers.

We’ve uploaded the dataset from UCI Machine Learning Repository to Dropbox specifically for this course so it can be easily read directly into R as a CSV. This eliminates the need for a manual download of the Excel file to your local directory, saving it as .CSV, and then writing code in R to read from that specific directory. We’ve basically made the process easy for you.

We like this dataset because it looks similar to datasets you’d encounter in real-world business settings. Typically, you’d be able to access transaction datasets like this one from your company’s data warehouse.



Reading In/Importing CSV Files

To read in the Online Retail CSV dataset as a Data Frame in R, you’d use the fread() function from the data.table package in R. The name fread stands for “fast read”, and it’s Remix Institute’s only recommended way of reading in CSV files. The code for doing that is below:

R Code for Reading In CSVs

# Read In/Import into R the Online Retail CSV dataset from Remix Institute's Dropbox

online_retail_data = data.table::fread("https://www.dropbox.com/s/ygecmz70oy5ch9i/Online%20Retail.csv?dl=1", header = TRUE, stringsAsFactors = FALSE)

The string inside the quotation marks is the location of the CSV file you want to read in. The header = TRUE argument means that the first row of the CSV file contains the column (header) names.

Writing/Exporting CSV Files

To write out a Data Frame in R as a CSV file to a local directory, you’d use the fwrite() function from the data.table package in R. The name fwrite stands for “fast write”, and it’s Remix Institute’s only recommended way of writing out CSV files. The code for doing that is below:

R Code for Writing to CSV

# Write Out/Export an R Data Frame as CSV to your local directory

# 1. Replace online_retail_data with the name of the dataframe you want to export
# 2. Replace the string inside the quotations to the location of the directory and filename you want to export to

data.table::fwrite(online_retail_data, "C:/Users/RemixStudent/Documents/online_retail_data.csv")


Understanding The Columns And Data Types Of Your Data Frame

According to University of California Irvine Machine Learning Repository’s website, the column and data type information for the dataset is as follows:

ColumnDescriptionData Type
InvoiceNoInvoice number. Nominal, a 6-digit integral number uniquely assigned to each transaction. If this code starts with letter ‘c’, it indicates a cancellation.
StockCodeProduct (item) code. Nominal, a 5-digit integral number uniquely assigned to each distinct product.
DescriptionProduct (item) name. Nominal.
QuantityThe quantities of each product (item) per transaction. Numeric.Integer
InvoiceDateInvoice Date and time. Numeric, the day and time when each transaction was generatedInteger
UnitPriceUnit price. Numeric, Product price per unit in sterling.Integer
CustomerIDCustomer number. Nominal, a 5-digit integral number uniquely assigned to each customer.Integer
CountryCountry name. Nominal, the name of the country where each customer residesCharacter



Creating New Columns In A Data Frame

You’ll notice, based on UCI’s documentation, that any invoice numbers that start with ‘c’ are cancellations. We’re going to create a new column in the Data Frame called CancelledInvoiceFlag which indicates if the invoice was cancelled. In the R code below, we show two ways for creating new columns in R (see R code comments).

R Code for Creating a New Column called CancelledInvoiceFlag

We’ll also create another column called NegativeQuantityFlag which indicates if the value in the Quantity column is negative

Filtering A Data Frame

Filtering a data frame is the process of taking a subset or smaller part of the full dataset based on certain conditions that you specify. The conditions are applied to the columns of the data frame.

In any real world business scenario, sometimes it’s important to remove any cancellations or refunds in your analysis and model. You don’t want to attribute sales to the transaction if the customer cancelled the order or returned the product for a refund. In this course, we will be removing cancellations from the dataset.

Also, you’ll notice that there are some non-cancelled invoices where Quantity is a negative number. Many times, these non-cancelled invoices with a negative Quantity number have a StockCode but no Description or CustomerID. These look like bad data points and should be removed from the dataset before analysis or modeling.

Selecting Columns In A Data Frame

Grouping in R

Summarizing A Data Frame in R

Merging And Joining Data Frames in R

Citation