Dealing with data preparation: best practices - Part 2
If you haven't read yesterday's part 1 of this article, be sure to check it out before reading this article.
Getting started with data preparation: best practices
The challenge is getting good at DP. As a recent report by business intelligence pioneer Howard Dresner found, 64% of respondents constantly or frequently perform end-user DP, but only 12% reported they were very effective. Nearly 40% of data professionals spend half of their time prepping data rather than analyzing it.
Following are a few of the practices that help assure optimal DP for your AI and BI projects. Many more can be found from data preparation service and product suppliers.
Best practice 1: Decide which data sources are needed to meet AI and BI requirements
Take these three general steps to data discovery:
- Identify the data needed to meet required business tasks.
- Identify potential internal and external sources of that data (and include its owners).
- Assure that each source will be available according to required frequencies.
Best practice 2: Identify tools for data analysis and preparation
It will be necessary to load data sources into DP tools so the data can be analyzed and manipulated. It’s important to get the data into an environment where it can be closely examined and readied for the next steps.
Best practice 3: Profile data for potential and selected source data
This is a vital (but often discounted) step in DP. A project must analyze source data before it can be properly prepared for downstream consumption. Beyond simple visual examination, you need to profile data, detect outliers, and find null values (and other unwanted data) among sources.
The primary purpose of this profiling analysis is to decide which data sources are even worth including in your project. As data warehouse guru Ralph Kimball writes in his book, The Data Warehouse Toolkit , 'Early disqualification of a data source is a responsible step that can earn you respect from the rest of the team'.
Best practice 4: Cleansing and screening source data
Based on your knowledge of the end business analytics goal, experiment with different data cleansing strategies that will get the relevant data into a usable format. Start with a small, statistically-valid sample to iteratively experiment with different data prep strategies, refine your record filters, and discuss the results with business stakeholders.
When discovering what seems to be a good DP approach, take time to rethink the subset of data you really need to meet the business objective. Running your data prep rules on the entire data set will be very time consuming, so think critically with business stakeholders about which entities and attributes you do and don’t need and which records you can safely filter out.
Final thoughts
Proper and thorough data preparation, conducted from the start of an AI/BI project, leads to faster, more efficient AI and BI down the line. DP steps and processes outlined here apply to whatever technical setup you are using, and they will get you better results.
Note that DP is not a 'do once and forget' task. Data is constantly generated from multiple sources that may change over time, and the context of your business decisions will certainly change over time. Partnering with data preparation solution providers is an important consideration for the long-term capability of your DP infrastructure.
Author: Wayne Yaddow
Source: TDWI