How processing speed defines who is successful in data science
For close to two decades, companies like Facebook, Amazon, JP Morgan and Uber have been writing the book on how to successfully use data science to grow their businesses. Thanks to these innovators and others, it’s now become a competitive requirement to be able to quickly abstract actionable insights from rapidly changing data.
Businesses can use AI to learn from massive amounts of data captured from a broad range of sensors and sources, but none of this knowledge can be gained without processing these volumes of data.
Trouble is, building an end-to-end data science practice is easier said than done. Even as companies compete to find the best data scientists, they struggle with how to get the most value out of their investment because of poorly defined differentiation between data science roles and because of bottlenecks created by using legacy CPU architectures in conjunction with trying to use legacy tools and software on unprecedented volumes of data. New AI data types, such as audio and video used in computer vision and conversational AI, are difficult to integrate into legacy systems.
At NVIDIA, we see the data engineer’s role as ingesting unstructured, noisy data and cleaning it up for data scientists who are exploring and experimenting as they build models and analyze patterns. A machine learning engineer is the person architecting the entire end-to-end process of machine and deep learning.
At any point along this data science lifecycle — using Jupyter Notebooks, running Apache Spark or SQL Server ETL (extraction, transformation and loading) — slow, CPU-based computing can stand in the way of analyzing ever-growing datasets quickly enough to be of value to the business. In fact, a 2020 survey found that more than half of data science professionals have trouble showing the impact data science has on business outcomes.
“Getting data science outputs into production, where they can impact a business, isn’t always straightforward,” data science software firm Anaconda said in the report that surveyed 2,360 people globally.
If anything, that understates the problem. It’s rarely ever straightforward.
A modern data science team faces the challenge of working collaboratively with CIOs, CTOs, and business units to create an end-to-end lifecycle for extracting actionable insights from data. Leading cloud service providers and startups alike are moving to meet this goal by offering accelerated computing platforms and software to speed the process of analytics and data processing.
“Getting data science outputs into production will become increasingly important, requiring leaders and data scientists alike to remove barriers to deployment and data scientists to learn to communicate the value of their work,” the Anaconda report recommends.
It Takes a Village: Speeding Full-Stack Data Science
Some organizations will speed the investment return on AI by building centralized, shared infrastructure at supercomputing scale. Others choose a hybrid approach, blending cloud and data center infrastructure. All are working to facilitate the grooming and scaling of data science talent, to share best practices and to accelerate the solving of complex AI problems.
NVIDIA is working with all leading cloud service providers and server manufacturers to help companies transform and analyze complex data sets and use machine learning to automate analysis. Many of these collaborations are based on accelerated computing platforms that combine both hardware and software to speed data science.
Key to much of this work is RAPIDS, a suite of open-source software libraries and APIs to run end-to-end data science and analytics pipelines entirely on NVIDIA GPUs. Walmart is one of the innovators actively contributing to the platform and deploying RAPIDS internally. The global supercenter leader is using AI to improve everything from customer experience, to stocking, to pricing.
By hiding the complexities of working with the GPU and the behind-the-scenes communication protocols within the data center architecture, RAPIDS creates a simple way to get data science done. As more data scientists use Dask, a flexible library for parallel computing in Python and other high-level languages, providing acceleration without code change is essential to rapidly improving development time.
Accelerated Data Science Speeds Business Success
Few successful enterprises operate without a finance, HR or marketing team. Accelerated data science is becoming an equally critical function as enterprises realize that their data is the key to winning more customers. Those who have yet to add data science expertise to their business are now operating in the dark, while competitors are already using data science to bring new opportunities to light.
In every industry, data scientists are eager to put their company’s most valuable assets to work. From data engineering to deploying AI models in production, accelerated data science is giving enterprises the speed needed to test more ideas, find more answers and drive success.
Author: Scott McClellan
Source: Insidebigdata