Why Python?

python

Why are you recommending Python?  That’s the question a colleague of mine asked when I was pitching Python for data science work.  It is a fair question, and I tried to answer with facts and not opinions.  Indeed, answering a question about why a language is better than others can quickly turn into a religious war.  So, let me try to avoid that with some disclaimers.  First of all, I don’t think one size fits all: Python is not going to become THE programming language.  Depending on the task, other languages are a much better fit.  For instance, Java for enterprise applications solving well defined problems.  Fortran, C, and C++ are great for HPC. C is dominant for systems programming.  Javascript + node.js, or PHP, are de facto standards for web site implementation.  I could go on forever, as many languages fit a niche.  But when it comes to data science, Python has taken the lead.  Let’s look at facts before you start arguing with me.

Facts

I am not the only one saying Python has the lead.  Here is a first fact supporting this.  It is the job trends for data science related topics on indeed.com.

jobgraph

These job trends are for: Python and (“data science” or “big data” or “statistical analysis” or “data mining” or “machine learning”), Scala and (“data science” or “big data” or “statistical analysis” or “data mining” or “machine learning”), R and (“data science” or “big data” or “statistical analysis” or “data mining” or “machine learning”) .

I selected R, Python, and Scala for this comparison because they are the most popular open source languages for data science.  R has been for long the dominant open source for statisticians, and by extension, for data science.  But we see that Python is taking over since a couple of years.  Scala is a recent contender, because of its link to Spark and Spark ML but it is a quite distant follower still.

What about commercial software?  I do think that SPSS modeler is here to stay as well for instance.  But its target is a bit different from R, Python or Scala.  Indeed, SPSS modeler is a click and point software aimed at non programmers.  With SPSS modeler one draws the machine learning pipeline, whereas one programs it in Python, R, or Scala.  It is because of this difference that I did not include SPSS modeler in the comparison, as it would be comparing apple to orange.

Back to open source, here are other signs of Python popularity.  The table below includes the number of questions on stack overflow, the number of packages in the main package repository for the language, and the programming community index on tiobe.com.  For Scala, to be fair, one should count all Java libraries.  I did not find a simple way to evaluate their numbers, hence I left it blank.

Python R Scala
Stackoverflow questions 527,550 122,907 46,580
Packages 73,402 packages onPyPI 7,798 packages on CRAN Java libraries
Tiobe index
(lower is better)
5 19 30­

These measure the strength and popularity of the ecosystems built around these languages.  Indeed, when comparing languages, one should not just do a feature by feature comparison, or efficiency benchmarks.  Having a vibrant community that can help newcomers, and that can further advance the language, is key.

There are probably additional ways to evaluate the importance of an ecosystem, and I welcome suggestions.

We can also get facts about the main data scientists IDE for the languages: IPython/Jupyter for Python notebooks, RStudio for R scripts, andApache Zeppelin for Scala notebooks.  I look at the number of stack overflow questions, at the number of github repositories using these languages, then the stars, forks, commits, and contributors for the main github repository for Jupyter/IPython, RStudio, and Zeppelin.  Last, I indicate the number of programming languages supported by the IDE.

Jupyter/IPython R Studio Zeppelin
Stackoverflow
questions
18,296 8,333 365
Github repos 5,530 858 269
Github stars 8,593 1,220 1,122
Github forks 2,528 288 532
Github commits 21,351 14,665 2,030
Github contributors 434 34 95
Languages supported (kernels) 62 1 4

We see that here too IPython/Jupyter has a lead, but that RStudio is quite popular too.  Zeppelin, due to it being very recent, is not much popular yet but it is actively being developed.  Popularity differences show on Google trends too:

trends_r

There we see that the renaming of IPython into Jupyter isn’t fully recognized, as IPython search popularity is still ahead of Jupyter even if it is stalling.  We also see that RStudio popularity is great too.  And we see Zeppelin only starting.

The above facts support the view that Python is the leading open source for data scientists.  They also supports the view that R is a bit less popular, but growing too.

Enough of facts, let’s move into opinions to make our final decision.  You can stop reading here if you have strong opinions on Python, R, or Scala as you may disagree with me.

Opinions

You’ve been warned, so here it is.   I am clearly biased towards delivering commercial software that uses data science.  For this use case, Python is a better fit than R.  In a nutshell, Python becomes way better than R when it comes to turn data science into production at scale.  R may still be better in the exploration phase of a science project.

Let me hint at some reasons why I believe the above is true.  Each of these is probably worth a post on its own, but let me list them here as a starting point.

I don’t think Python is better than R for data analysis per se.  In that respect, R is more comprehensive as nearly all statistical techniques you can think of exists in R.  And lots of machine learning too.  Python ecosystem is quickly catching up with its scientific stack and packages such as sklearn, pandas, statsmodels, matplotlib, seaborn, etc (I can’t cite them all), but it is not as comprehensive as R ecosystem yet.

So, why Python and not R?  Because Python can be used beyond data analysis.  You can build web sites in Python, you can connect to almost any data source, you can leverage an incredible number of systems and tools as many of them expose a python api, you can visualize results, you can implement arbitrary algorithms, you can compile it to get good performance, etc.  Python machine learning packages are also more efficient on average than their R counterparts.  Add to that the ease of learning and using Python, and you get an environment where people can very quickly demonstrate, then productise, new ideas and new techniques.  Last, but not least, Python is taking the lead over R in some machine learning areas.  For instance, a recent deep learning library such a TensorFlow exposes a Python API, but not a R API.  The Python Spark APi is also more comprehensive than the R API.  And if you really need some R package, then you can call R from Python!

Another reason for selecting Python to me is that R comes with a GPL license.  This license forces you to open source any software that includes R.  Therefore, using R is restricted to either those who do not care about shipping software, or to those building open source software.  R license is just not a fit for those developing commercial software like me.

Conclusion

I hope I answered the initial question correctly: why Python?  The answer in general is all about the ecosystem and the breadth of what it covers.  And when it comes to commercial software development, the R license is the straw that broke the camel’s back.

It does not mean that Python should be the only language we use when building solutions.  But we certainly leverage Python more than other programming languages for the data science pieces.

BLOG POSTS

ADDRESS

650 Parliament Street, Toronto,Ontraio, Canada
Phone: (416) 939-0044
Fax: (647) 720-2214
Website: http://www.datajadoo.com
Email: info@datajadoo.com

DISCLAIMER

Important:: This site has been setup purely for showcasing the analytic's skills of Data Jadoo. All the content are designed by Data Jadoo. Author retains his or her views on the topics expressed here. All images are copyrighted to their respective creators.