Is AI a threat or an opportunity to data engineers?
Humans losing jobs to robots has been the preoccupation of economists and sci-fi writers alike for almost 100 years. AI systems are the next perceived threat to human jobs, but which jobs? Sourcing the logic from numerous open source packages or paid API services, connecting disparate datasets, and maintaining a pipeline are complex tasks that AIs are ill-suited to do at present.
AI and the data pipeline
A well set up data pipeline is a thing of beauty, seamlessly connecting multiple datasets to a business intelligence tool to allow clients, internal teams, and other stakeholders to perform complex analysis and get the most out of their data.
Data engineers thrive on interesting challenges: bringing terabytes of data from wherever it lives to where it can be analyzed, transforming it using various libraries and services, and keeping the pipeline stable. However, the data preparation phase of the whole process poses its own issues. It can be a creative process, and it’s certainly necessary, but saving and automating the repetitive usage of the logic every X amount of hours is a challenge. Today, the way to solve this challenge is by bringing in artificial intelligence and machine learning.
Augmented analytics is the next iteration of business intelligence, where AI elements are incorporated into every phase of the BI process. The powerful AI (artificial intelligence) analytics systems that are emerging today have AI assisting users in a broad range of ways, but we’ll stay focused on data prep for this article.
Three sections of the data preparation process where AI can help that we’ll discuss are data cleaning and transformation, extracting and loading, and verifying the prepared data.
Clean as you go
The saying 'data is the new oil' gets tossed around enough to have already become a cliche, but for purposes of our discussion it’s an especially apt metaphor. Most companies are sitting on huge stores of data, but in its unprocessed form, it’s not very useful. Even worse, analyzing non-normalized data boils down to potentially harmful and misleading results. To continue with the oil metaphor, you need a stable and reliable pipeline to take your data from where it’s stored to where it’ll be processed so that its true value can be harnessed.
While you’re moving that data, data engineers have the ability to digest it so that it’s closer to being in a usable state by the time it hits the BI system. BI platforms are already using AI to help with the data cleansing process in a variety of ways. Let’s walk through how AI can assist you:
- AI assistance can recommend a date model structure, including which columns to join, which to compound, and maybe even create dimension tables to facilitate the fact table joins.
- AI systems can apply simple rulesets to help standardize the data by doing things like making all text lowercase and removing blank spaces before and after values.
- If you already have a perfectly formatted dataset to use as a learning dataset, AI assistance can even be trained on this to recognize how the larger dataset should look, allowing it to take a holistic approach to cleansing, rather than you telling it specific tasks to do.
- As AI assistance learns how you want your data to look, the system can even scan all the columns and make recommendations as to what to fix, implement active learning, or go ahead and fix errors on its own, such as removing redundant records (deduplication caused by misspelling, for example) or using context clues to fill in missing values.
Extracting and loading
The rise of cloud data warehouses has changed the way companies treat their data. In the past, well-organized databases were needed to keep records in order. Today, data comes from a wide array of different sources and in a variety of different forms, from user-generated to sensory data. More and more frequently we even witness companies using third party data to enrich their business logic (how the weather forecast will affect my sales?).
This change coincided with an increase in the sophistication of AI data analytics systems, allowing them to deal with data in all its types, structured (numerical) and unstructured (text, image, video). Data storage on cloud warehouses like Redshift is so cheap and there can often be different roles responsible for data gathering and storage, so rather than worry about how everything is formatted, companies just pump everything into the warehouse, however it’s formatted, and deal with it later.
This is another place where BI with AI has a chance to shine, extracting the data, performing transformations on it, then loading it into the BI tool. The same AI abilities mentioned before can be applied in this way to end up with usable data at the endpoint: removing duplicate records, filling blank values, and suggesting other cleansing and transformation actions, such as clustering and segmentation, based on the learning dataset. However your data is stored, the right AI analytics tool can help get it into better shape for when you create your single source of truth; it can also help as you load your data into your BI platform or data science tool.
While you’re moving your data into your BI system, the big chance for an AI assist is in monitoring the process. If a load fails, exceeds the normal time threshold or the forecasted one, the AI can learn that and ping the engineer to let them know there’s a problem. A sudden change in the volume of data being loaded could also be worth a mention, so that the engineer can look into it and see if there’s a larger problem.
The bottom line is that a strong AI analytics system can be a second set of eyes for a busy data engineering team, freeing them to focus on the challenges that drive more value to the analytics team, and ultimately the business.
Outliers, efficiency, and verifying results
Outlier detection is one task that an AI system can be designed to handle that would have huge benefits for data engineers dealing with large volumes of not-quite-perfect data. The AI would monitor tables as they get created and new data gets loaded, and check the outputs. As the system scans the values within a column, it could test for things like uniqueness, referential integrity (to values that are keys in other tables), skewed distribution, null values, and accepted values. It would basically be checking the whole table and saying 'does this column look correct'? based on a series of rules that could be applied to it. If the AI believes that one of the rules could apply, and that the columns values do not meet the rule’s conditions, then it would send an alert to the engineers.
Trusting your data without checking your work is a recipe for disaster. Having a few questions you already know ballpark answers to can be a great way to test your AI-prepped data in the aftermath. If your answers come back within acceptable limits, then you know the prep process was (acceptably) successful. If there are major discrepancies, you may have to retrain the system or adjust the strictness/laxness of the settings you’re using.
Some other tasks a BI system with AI can assist with include showing you which joins are occurring most frequently across your model and suggesting pre-aggregation. This could prove useful for data analysts to know and help them with speedier queries down the road. AI could also scan columns and test for uniqueness. For example, if every value needs to be unique, like an ID column for all your Salesforce accounts, and there are two different users with the same account ID, then the AI could call that out. For purely numerical data, AI could identify outliers that might indicate improperly entered data. Either way, the AI is once again an extra set of eyes, performing detailed, routine work, at scale, and surfacing the results to human data engineers only when necessary.
Is AI taking engineering jobs?
Although humans losing jobs to robots is a nice story, in reality, it is far from the truth for data engineers. Tackling routine tasks like eliminating redundant data, filling in gaps in datasets, and pinging human engineers when anomalies arise are all places where AI analytics systems can really add value, doing the heavy lifting that humans don’t really want to do anyway, and augment hard-working data engineers to tackle the challenging problems that will lead to bigger rewards for the company down the line.
Author: Inna Tokarev Sela
Source: Sisense