The Language of Data Science: Python vs R
Python may be the second choice to R, but its popularity and ease of use positions it to dominate data science.
“When [Netflix’s data science team] started, there was one single kind of data scientist,” says Christine Doig, director of innovation for personalized experiences at Netflix. “Now the role has been integrated into the organization.” This isn’t just a Netflix thing. Across all industries, enterprises are embracing data science to craft personalized, engaging experiences, optimize pricing, and more. As they do so, they’re expanding the use of data science into product management, marketing, and other areas.
This is why the language that organizations use to decipher their data will increasingly be Python, not R. As organizations look to a more diverse group to help with data science, Python’s mass appeal makes for an easy on-ramp.
R or Python?
Historically, if you wanted to do data science, you needed to know R. As detailed on the R project’s site, “R is an integrated suite of software facilities for data manipulation, calculation, and graphical display.” It’s not really a programming language, per se, but includes one. Originally built for statistical and numerical analysis, R has remained true to those roots and remains an excellent tool, particularly for statisticians in their role as data scientists. This strength can also be a weakness, given the spread of data science well beyond the area of statistical analysis.
It’s true, as Sheetal Kalburgi, associate product manager at Anaconda, points out, that “data scientists are more technical and statistical” and often are “responsible for tasks like developing complex statistical algorithms that communicate product performance, predict outcomes, design experiments such as A/B testing, and optimize computational operations, to name a few.” But they also tend to be well versed in programming, which is where your average data scientist is much more likely to have a programming background than a hard-core statistics background.
Even if a company’s business problem centers on statistics, it’s still often going to be the case that Python will prove superior, if only because of familiarity. As Van Lindberg, general counsel for the Python Software Foundation told me, “Python is the second-best language for everything. R may be the best for stats, but Python is the second … and the second-best for [machine learning], web services, shell tools, and (insert use case here). If you want to do more than just stats, then Python’s breadth is an overwhelming win.”
No one really wants the silver medal instead of gold, but in this case, second place means Python will make itself useful for a much broader array of use cases. As Peter Wang, CEO of Anaconda, said in an interview, “Python had a broader scope from the beginning.” Engineering and science DNA is “baked into the Python core.” It’s therefore going to be the right answer much more often than R.
Python swallows data science
That’s not a criticism of R so much as a recognition of the momentum and mass Python has going for it. According to a recent SlashData survey of more than 20,000 developers, Python is a developer darling, coming in second only to JavaScript in terms of popularity. Part of this stems from the huge community around Python that extends Python’s utility into all sorts of domains (deep learning, artificial intelligence, and more) while fine-tuning it in key areas to improve performance. It’s increasingly difficult to find any areas where Python isn’t pushing to be the first-choice option, not merely “second best,” to use Lindberg’s phrasing.
Part of Python’s popularity stems simply from how easy it is to use. Given that enterprises are desperately trying to find data science talent, the easiest path is to mint existing employees. Even those without an engineering background find it easy to embrace Python’s simple syntax and readability and appreciate how useful it is for quick prototyping.
Lately, Python's ease of use has gotten even easier as Anaconda released PyScript, which makes Python more accessible to front-end developers by making it possible to write Python in HTML to build web applications. This is just one more innovation in a long string of innovations in the Python community to expand the breadth and depth of what developers and data scientists can do with Python.
Those innovations, and the Python community that benefits from them, increasingly make the decision to use Python that much easier. For areas where R or another alternative might be first choice, Wang suggests Python’s history as a great glue language means that “maybe someone will build a nice Python wrapper to expose a thin shim to expose some R capabilities” or otherwise make it easy for a data scientist to build with Python while adding complements from other communities, like R.
All this helps explain why Python looks set to help drive the next decade of data science, given how robust it is for experienced data scientists and less-experienced aspirants.
Author: Matt Asay
Source: Infoworld