2 items tagged "data scientists"

  • How the skillset of data scientists will change over the next decade

    How the skillset of data scientists will change over the next decade

    AutoML is poised to turn developers into data scientists — and vice versa. Here’s how AutoML will radically change data science for the better.

    In the coming decade, the data scientist role as we know it will look very different than it does today. But don’t worry, no one is predicting lost jobs, just changed jobs.

    Data scientists will be fine — according to the Bureau of Labor Statistics, the role is still projected to grow at a higher than average clip through 2029. But advancements in technology will be the impetus for a huge shift in a data scientist’s responsibilities and in the way businesses approach analytics as a whole. And AutoML tools, which help automate the machine learning pipeline from raw data to a usable model, will lead this revolution.

    In 10 years, data scientists will have entirely different sets of skills and tools, but their function will remain the same: to serve as confident and competent technology guides that can make sense of complex data to solve business problems.

    AutoML democratizes data science

    Until recently, machine learning algorithms and processes were almost exclusively the domain of more traditional data science roles—those with formal education and advanced degrees, or working for large technology corporations. Data scientists have played an invaluable role in every part of the machine learning development spectrum. But in time, their role will become more collaborative and strategic. With tools like AutoML to automate some of their more academic skills, data scientists can focus on guiding organizations toward solutions to business problems via data.

    In many ways, this is because AutoML democratizes the effort of putting machine learning into practice. Vendors from startups to cloud hyperscalers have launched solutions easy enough for developers to use and experiment on without a large educational or experiential barrier to entry. Similarly, some AutoML applications are intuitive and simple enough that non-technical workers can try their hands at creating solutions to problems in their own departments—creating a “citizen data scientist” of sorts within organizations.

    In order to explore the possibilities these types of tools unlock for both developers and data scientists, we first have to understand the current state of data science as it relates to machine learning development. It’s easiest to understand when placed on a maturity scale.

    Smaller organizations and businesses with more traditional roles in charge of digital transformation (i.e., not classically trained data scientists) typically fall on this end of this scale. Right now, they are the biggest customers for out-of-the-box machine learning applications, which are more geared toward an audience unfamiliar with the intricacies of machine learning.

    • Pros: These turnkey applications tend to be easy to implement, and relatively cheap and easy to deploy. For smaller companies with a very specific process to automate or improve, there are likely several viable options on the market. The low barrier to entry makes these applications perfect for data scientists wading into machine learning for the first time. Because some of the applications are so intuitive, they even allow non-technical employees a chance to experiment with automation and advanced data capabilities—potentially introducing a valuable sandbox into an organization.
    • Cons: This class of machine learning applications is notoriously inflexible. While they can be easy to implement, they aren’t easily customized. As such, certain levels of accuracy may be impossible for certain applications. Additionally, these applications can be severely limited by their reliance on pretrained models and data. 

    Examples of these applications include Amazon Comprehend, Amazon Lex, and Amazon Forecast from Amazon Web Services and Azure Speech Services and Azure Language Understanding (LUIS) from Microsoft Azure. These tools are often sufficient enough for burgeoning data scientists to take the first steps in machine learning and usher their organizations further down the maturity spectrum.

    Customizable solutions with AutoML

    Organizations with large yet relatively common data sets—think customer transaction data or marketing email metrics—need more flexibility when using machine learning to solve problems. Enter AutoML. AutoML takes the steps of a manual machine learning workflow (data discovery, exploratory data analysis, hyperparameter tuning, etc.) and condenses them into a configurable stack.

    • Pros: AutoML applications allow more experiments to be run on data in a larger space. But the real superpower of AutoML is the accessibility — custom configurations can be built and inputs can be refined relatively easily. What’s more, AutoML isn’t made exclusively with data scientists as an audience. Developers can also easily tinker within the sandbox to bring machine learning elements into their own products or projects.
    • Cons: While it comes close, AutoML’s limitations mean accuracy in outputs will be difficult to perfect. Because of this, degree-holding, card carrying data scientists often look down upon applications built with the help of AutoML — even if the result is accurate enough to solve the problem at hand.

    Examples of these applications include Amazon SageMaker AutoPilot or Google Cloud AutoML. Data scientists a decade from now will undoubtedly need to be familiar with tools like these. Like a developer who is proficient in multiple programming languages, data scientists will need to have proficiency with multiple AutoML environments in order to be considered top talent.

    “Hand-rolled” and homegrown machine learning solutions 

    The largest enterprise-scale businesses and Fortune 500 companies are where most of the advanced and proprietary machine learning applications are currently being developed. Data scientists at these organizations are part of large teams perfecting machine learning algorithms using troves of historical company data, and building these applications from the ground up. Custom applications like these are only possible with considerable resources and talent, which is why the payoff and risks are so great.

    • Pros: Like any application built from scratch, custom machine learning is “state-of-the-art” and is built based on a deep understanding of the problem at hand. It’s also more accurate — if only by small margins — than AutoML and out-of-the-box machine learning solutions.
    • Cons: Getting a custom machine learning application to reach certain accuracy thresholds can be extremely difficult, and often requires heavy lifting by teams of data scientists. Additionally, custom machine learning options are the most time-consuming and most expensive to develop.

    An example of a hand-rolled machine learning solution is starting with a blank Jupyter notebook, manually importing data, and then conducting each step from exploratory data analysis through model tuning by hand. This is often achieved by writing custom code using open source machine learning frameworks such as Scikit-learn, TensorFlow, PyTorch, and many others. This approach requires a high degree of both experience and intuition, but can produce results that often outperform both turnkey machine learning services and AutoML.

    Tools like AutoML will shift data science roles and responsibilities over the next 10 years. AutoML takes the burden of developing machine learning from scratch off of data scientists, and instead puts the possibilities of machine learning technology directly in the hands of other problem solvers. With time freed up to focus on what they know—the data and the inputs themselves — data scientists a decade from now will serve as even more valuable guides for their organizations.

    Author: Eric Miller

    Source: InfoWorld

  • Should we care more about ethics in a data science environment?

    Should we care more about ethics in a data science environment?

    The big idea

    Undergraduate training for data scientists - dubbed the sexiest job of the 21st century by Harvard Business Review - falls short in preparing students for the ethical use of data science, our new study found.

    Data science lies at the nexus of statistics and computer science applied to a particular field such as astronomy, linguistics, medicine, psychology or sociology. The idea behind this data crunching is to use big data to address otherwise unsolvable problems, such as how health care providers can create personalized medicine based on a patient’s genes and how businesses can make purchase predictions based on customers’ behavior.

    The U.S. Bureau of Labor Statistics projects a 15% growth in data science careers over the period of 2019-2029, corresponding with an increased demand for data science training. Universities and colleges have responded to the demand by creating new programs or revamping existing ones. The number of undergraduate data science programs in the U.S. jumped from 13 in 2014 to at least 50 as of September 2020.

    As educators and practitioners in data science, we were prompted by the growth in programs to investigate what is covered, and what is not covered, in data science undergraduate education.

    In our study, we compared undergraduate data science curricula with the expectations for undergraduate data science training put forth by the National Academies of Sciences, Engineering and Medicine. Those expectations include training in ethics. We found most programs dedicated considerable coursework to mathematics, statistics and computer science, but little training in ethical considerations such as privacy and systemic bias. Only 50% of the degree programs we investigated required any coursework in ethics.

    Why it matters

    As with any powerful tool, the responsible application of data science requires training in how to use data science and to understand its impacts. Our results align with prior work that found little attention is paid to ethics in data science degree programs. This suggests that undergraduate data science degree programs may produce a workforce without the training and judgment to apply data science methods responsibly.

    It isn’t hard to find examples of irresponsible use of data science. For instance, policing models that have a built-in data bias can lead to an elevated police presence in historically over-policed neighborhoods. In another example, algorithms used by the U.S. health care system are biased in a way that causes Black patients to receive less care than white patients with similar needs.

    We believe explicit training in ethical practices would better prepare a socially responsible data science workforce.

    What still isn’t known

    While data science is a relatively new field – still being defined as a discipline – guidelines exist for training undergraduate students in data science. These guidelines prompt the question: How much training can we expect in an undergraduate degree?

    The National Academies recommend training in 10 areas, including ethical problem solving, communication and data management.

    Our work focused on undergraduate data science degrees at schools classified as R1, meaning they engage in high levels of research activity. Further research could examine the amount of training and preparation in various aspects of data science at the Masters and Ph.D. levels and the nature of undergraduate data science training at schools of different research levels.

    Given that many data science programs are new, there is considerable opportunity to compare the training that students receive with the expectations of employers.

    What’s next

    We plan to expand on our findings by investigating the pressures that might be driving curriculum development for degrees in other disciplines that are seeing similar job market growth.

    Source: The Conversation

EasyTagCloud v2.8