CRISP-DM Data Science Process Model

2 minute read

What is the Analytics and Machine Learning Development Lifecycle?

Companies are hiring data scientists and data analysts like crazy. Many of them are just now jumping on the bandwagon and building data science, analytics, and machine learning teams from scratch.

The Analytics and Machine Learning Development Lifecycle is basically the development process a data scientist or data analyst undergoes to build a machine learning model or to do analytics for the business.

Given that it could take years before data science and machine learning becomes part of the business operations, it is important to begin implementing a standardized development process for the Analytics and Machine Learning Development Lifecycle. The benefits of a standardized process are:

  • Assurances for stakeholders that best practices are being followed
  • Establishment of governance for model building
  • Maintaining of quality standards
  • Ensuring the re-usability of code and reproducibility of modeling results and analysis findings
  • Save time and cost of future model development or ad-hoc analytics
  • Process can be easily audited
  • Transparency and ease of version control
  • Organized and confusion-free

What is CRISP-DM?

The Analytics and Modeling Development Lifecycle is the standard workflow of a typical analytics or statistical/machine learning model project. It is common to use the CRISP-DM process model, which was developed companies such as Teradata, Mercedes-Benz Group (Daimler AG), and NCR Corporation and is used by companies like IBM. CRISP-DM breaks the process of data mining, analytics, and modeling into 6 phases:

  • Business Understanding – This phase involves gathering the business requirements from relevant stakeholders such as what problems they’re looking to solve or what pain points they have.
  • Data Understanding – This phase involves taking an inventory of what data is available by the business that’s needed to solve the problem addressed during the Business Understanding phase. This also involves abstraction of the data you’d need to build in order to provide a solution.
  • Data Preparation – This phases involve all the data wrangling, data cleaning, data QA, imputations, handling of missing data, joins, derived columns, and many other steps. It’s been said that 80% of a data scientist and data analyst’s time is spent on the Data Preparation phase.
  • Modeling – This phase involves building a statistical or machine learning model to solve the business problem addressed in the Business Understanding phase. This phase can also meaning producing analytical insights from the data (not necessarily via statistical modeling) in order to solve the business problem or provide a recommendation based on those insights.
  • Evaluation – This phase involves determining if your statistical or machine learning model produces accurate measurements in out-of-sample data and has real-world application. This phase can also mean ensuring your analytical insights from the data (not necessarily derived via statistical modeling) are accurate and reliable.
  • Deployment – This phase involves operationalizing your statistical/machine learning model into a production environment for consumption by the business, or it may involve deploying your analytical insights as strategic recommendation inputs for future business decision-making.

What about Six Sigma’s DMAIC model?

Six Sigma professionals use a similar method called DMAIC. This stands for Design, Measure, Analyze, Improve, and Control. While there are some similarities between DMAIC and CRISP-DM, DMAIC is more for operations professionals and for statistical process control. CRISP-DM is specifically for data science and machine learning development lifecycles in almost any vertical.