Dealing with data preparation: best practices - Part 1
IBM is reporting that data quality challenges are a top reason why organizations are reassessing (or ending) artificial intelligence (AI) and business intelligence (BI) projects.
Arvind Krishna, IBM’s senior vice president of cloud and cognitive software, stated in a recent interview with the Wall Street Journal that 'about 80% of the work with an AI project is collecting and preparing data. Some companies aren’t prepared for the cost and work associated with that going in. And you say: ‘Hey, wait a moment, where’s the AI? I’m not getting the benefit.’ And you kind of bail on it'.
Many businesses are not prepared for the cost and effort associated with data preparation (DP) when starting AI and BI projects. To compound matters, hundreds of data and record types and billions of records are often involved in a project’s DP effort.
However, data analytics projects are increasingly imperative to organizational success in the digital economy, hence the need for DP solutions.
What is AI/BI data preparation?
Gartner defines data preparation as 'an iterative and agile process for exploring, combining, cleaning, and transforming raw data into curated datasets for data integration, data science, data discovery, and analytics/business intelligence (BI) use cases'.
A 2019 International Data Corporation (IDC) study reports that data workers spend a remarkable time each week on data-related activities: 33% on data preparation compared to 32 % on analytics (and, sadly, just 13% on data science). The top challenge cited by more than 30% of all data workers in this study was that 'too much time is spent on data preparation'.
The variety of data sources, the multiplicity of data types, the enormity of data volumes, and the numerous uses for data analytics and business intelligence, all result in multiple data sources and complexity for each project. Consequently, today’s data workers often use numerous tools for DP success.
Capabilities needed in data preparation tools
Evidence in the Gartner Research report Market Guide for Data Preparation Tools shows that data preparation time and reporting of information discovered during DP can be reduced by more than half when DP tools are implemented.
In the same research report, Gartner lists details of vendors and DP tools. The analyst firm predicts that the market for DP solutions will reach $1 billion this year, with nearly a third (30%) of IT organizations employing some type of self-service data preparation tool set.
Another Gartner Research Circle Survey on data and analytics trends revealed that over half (54%) of respondents want and need to automate their data preparation and cleansing tasks during the next 12 to 24 months.
To accelerate data understandings and improve trust, data preparation tools should have certain key capabilities, including the ability to:
- Extract and profile data. Typically, a data prep tool uses a visual environment that enables users to extract interactively, search, sample, and prepare data assets.
- Create and manage data catalogs and metadata. Tools should be able to create and search metadata as well as track data sources, data transformations, and user activity against each data source. It should also keep track of data source attributes, data lineage, relationships, and APIs. All of this enables access to a metadata catalog for data auditing, analytics/BI, data science, and other operational use cases.
- Support basic data quality and governance features. Tools must be able to integrate with other tools that support data governance/stewardship and data quality criteria.
Keep an eye out for part 2 of this article, where ake a deeper dive into best practices for data preparation.
Author: Wayne Yaddow