facebook-pixel
  • Data ScienceEditor's Picks
  • Experfy Editor
  • MAR 30, 2014

Is Python Becoming the King of the Data Science Forest?

R has served as the de facto tool used for big data analytics.  According to RedMonk’s bi-annual rankings of the top 20 programming languages, as measured by activity on StackOverflow and GitHub repositories,  R is ranked #15 among all programming languages. This ranking is both surprising and impressive for a domain-specific language.  Interestingly, Python is at the top of the list among the top-dogs—Java, Javascript and PHP—that are used for general purpose web-programming.  Lesser-known languages such as Julia are also represented in the rankings, although not in the top-20 list.  The first quarter plot for 2014 ranking is shown here.RedMonk Language Rankings 2014

Despite R’s apparent success—as MongoDB’s Matt Asays has argued—while R was once the language of choice for data scientists, it is quickly ceding ground to Python. One of the reasons for a perceived decrease in R’s popularity—it is argued—is its complex programming environment that requires special training. According to Robert Muenchen at the University of Tennessee, even for data scientists who possess expertise in statistical tools such as SAS, SPSS and Stata—R remains a tough language to master.  This is largely because R uses misleading function and parameter names. If SAS, SPSS and Stata use the “sort” command to sort data sets, R has the same command but it does not sort data sets; instead R uses the command to sort individual variables.  In R, one must use the “order” function to sort data sets and that too happens in a rather convoluted manner.  In addition, R suffers from sparse non-standard output, and it has too many commands to master.  R also provides a sloppy control over variables and naming or remaining variables is an overly complex exercise, at least for the novice.

Python, on the other hand, is much easier to master—even though it may still be harder than other programming languages used to develop web applications.  The fact that Python is used to develop web applications is what makes it an attractive choice for data science.  If you are struggling to find qualified data scientists, why not train your existing Python developers to work in your data science teams?  Furthermore, given the wide applicability of the language, we are witnessing what Tal Yarkoni of UT Austin calls the Pythonification of tools that are appropriate for data science.

The increasing homogenization (Pythonification?) of the tools I use on a regular basis primarily reflects the spectacular recent growth of the Python ecosystem. A few years ago, you couldn’t really do statistics in Python unless you wanted to spend most of your time pulling your hair out and wishing Python were more like R (which, is a pretty remarkable confession considering what R is like). Neuroimaging data could be analyzed in SPM (MATLAB-based), FSL, or a variety of other packages, but there was no viable full-featured, free, open-source Python alternative. Packages for machine learning, natural language processing, web application development, were only just starting to emerge.

These days, tools for almost every aspect of scientific computing are readily available in Python. And in a growing number of cases, they’re eating the competition’s lunch.

While there is little doubt that Python is going to become a dominant language for data scientists, how is it faring against other languages of the web?  The chart below provides some insights.

Python, Java, php, Javascript, R Job Trends graph

The growing popularity of Python is not surprising given its versatility.  To be sure, R still is far more powerful when it comes to data analytics.  However, Python is catching up, but does this really mean that its large number of followers are going to supplant R?  The chart above needs to be nuanced because it compares apples and oranges.  Charts like these are often used to make misguided arguments about R’s impending demise.  So, how does demand for R compare with other statistical tools such as SAS?
R statistics, SAS statistics Job Trends graph

This helps us nuance our understanding and see that while Python has significant traction, given its use in domains other than data science, the demand for R is also on the rise and the latter is not going to become obsolete anytime soon.  R continues to enjoy popularity among academics.

We would love to hear how you are staffing your current teams and what role R and Python play in your environment.

See a follow-up post on this topic: Can Python Replace R for Developing Predictive Models?

Need help with your R or Python project or simply need data scientists and visualizers to augment your existing team? Post your project in the Experfy Marketplace to solicit bids from vetted experts. Experfy has the world’s top data experts, who specialize in specific industry data and can ask the right questions of your data. You can also email [email protected] for more information.

The Harvard Innovation Lab

Made in Boston @

The Harvard Innovation Lab

Matching Providers

comments powered by Disqus.