data science pipeline

Zooming In On The Data Science Pipeline

Finding the right data science tools is paramount if your team is to discover business insight. Here are five things to look for when you search for your next data science platform.

If you are old enough to have grown up with the Looney Tunes cartoons, you probably remember watching clips of Wile E. Coyote chasing the Road Runner hoping to one day catch him. In each episode, the coyote would use increasingly outrageous tools to try to outwit his nemesis, only to fail disastrously each time. Without the right tools, he was forever doomed to failure.

As a data scientist, do you constantly feel like you are bringing the wrong tool to the job as you strive to find and capture one of the most valuable, yet elusive, targets around -- business insight?

As data science tools and platforms mature, organizations are constantly looking to find what their analysts need to be most effective in their jobs. The right tool could mean the difference between success and failure when put in the hands of capable data scientists.

As you are trying to find the right data science tools for your team, here are five areas to consider in your evaluation.


The first thing you need to evaluate when looking at a potential data science platform is what algorithms it supports. In your assessment of algorithms, you must understand just what your business is and what your data science organization will actually use.

There are many algorithms available. Some are generic in nature and can be used in a broad set of scenarios. Others are very specific to unique problem sets. In the hands of the right data scientist, both types of algorithms can be extremely advantageous and valuable. The challenge is that the more algorithms available, the harder it is for the team to select the correct one to meet the current business problem. In your evaluation, ensure that the algorithms known to your team are available and are not crowded out by algorithms they will not use.

In addition to the algorithms that are already pre-packaged as part of the data science platform, one area to look at is the extensibility of the platform. Can new algorithms be added? Are there marketplaces of new algorithms available for the platform? Can the team evolve the algorithms to meet their needs? Such extensibility will provide your team access to new and valuable algorithms as they become available and can become a critical success factor for your data science team.

Data Preprocessing

One of the main tasks your team will be performing is preparing the data. This involves cleaning the data, transforming it, breaking the conglomerate data into its parts, and normalizing it. Different types of algorithms have limitations on what data they can consume and use. Your data science platform must be able to take available data and prepare it for input into your process.

If you have text data in your environment, text processing can be a vital component to your data science platform. This can be as simple as parsing the text into individual words or it can involve more complex data, such as the meaning of these words, the topics associated with the text, or the sentiment of the text. If this is important to your data science program, make sure your platform has the right support for your use cases.

Model Training and Testing

Once you have the right data in the right format and you have chosen the right algorithm or set of algorithms, the next step is to use these to define a model. When evaluating data science tools, understand what this process of model training and testing looks like and how it functions.

In your evaluation, understand if this process is accomplished through a graphical user interface or through coding. With the training process, understand what parameters are available to measure the progress on the model creation and how to define stopping points. As an automated iterative process, you will want your team to define when that process is completed and when the results are good enough to move to the next step.

Look at the documentation output of the model development process. Does it give you enough traceability about what the resulting model is, how it works, and why it chose that model over other variations? These can be critical in selling your results to the business and are becoming a requirement from governments if the model has an impact on decisions where bias could be detrimental to people.


You might have a small team of data scientists or a large team with many different roles. Either way, it is important that your team members have an effective ecosystem where they can collaborate. This can involve collaboration on the cleaning of data, the development and testing of models, or on the deployment of these models into production.

With the shortage of data science resources in the market, some companies are starting to look outside the walls of their organizations for citizen data scientists -- individuals outside of the organization who can collaborate with your teams to perform analysis of data and create models. As the extent of your team boundaries grows, your requirements for a platform that enables that collaboration increase as well. Ensure that the platform you select can be used across those boundaries.

MLOps and Operationalization

Data science in the laboratory is important, but for the results of their work to be beneficial to your business in a sustainable and repeatable way, the data preprocessing and model deployment has to be operationalized. Creating models and deploying models to a production environment require different skills. Sometimes, you will have resources who span both disciplines, but as your team grows and becomes more complex, these resources will often be very different.

It is important that you assess the platform’s capabilities to facilitate the collaboration between the data scientists as well as the collaboration between the data scientists and MLOps, who have the responsibility for deployment and ongoing sustainability of these models.

Evaluate what mechanisms are in place in your platform to enable the models to be promoted from the development stage to production stage and what gates exists along the way to maintain system integrity.

Evaluate Your Platform

As you meet with potential vendors, make sure you know what your team needs to be successful and then use those criteria to evaluate the fit of the tool to the situation at hand. Using these five key areas of evaluation will provide you the basis for an effective set of conversations with your vendor. If you have the right tools on hand for your data scientists, hopefully you won’t find yourself like Wile E. Coyote -- getting burned in the end -- but rather capturing that elusive target: business value.

Author: Troy Hiltbrand

Source: TDWI