R and Python are incredibly good tools for data manipulation, data analysis and general development. They integrate extremely well with other database technologies like SQL Server and can be used for creative applications like data visualisation. They are both popular languages with a wide following of contributors and programmers that use them on a daily basis.
But one very common question I see regularly is - which one should I learn? Which one is better? And how does it compare to other data analysis software like SAS?
The short answer is that you should probably learn both if you can. This article will go through some of the key differences and explain how to productionise your data applications with whatever language you choose.
As a language, R is one of the easiest to learn and for this reason it is very popular in academic and statistics-based fields. Data scientists who use Windows machines find R a popular choice due to its ease of installation and its capability to develop machine learning and predictive models from data sets.
Some key facts:
- R is open source and free to use.
- One of the main tools for data manipulation is currently dplyr. It provides some performance and code readability improvements over the base libraries, so it is probably a package you should look into if you elect to learn some R code this year.
- It is single-threaded. While it is possible to run R in a multi-threaded application, it is difficult and thus performance in a lot of cases doesn’t quite compare to other languages like C, Ruby, or even Python.
How can you productionise R?
- You can now run R scripts inside SQL Server 2016.
- The inclusion of Microsoft R Server inside the SQL Server 2016 installation means you can run a massive range of scripts inside SQL stored procedures that can handle semi-structured data, connect to other tools like Hadoop and Spark, and even connect to external data sources like REST APIs.
- R is integrated into Azure ML. This means you can incorporate R scripts into your machine learning workflows and perform different types of data cleansing or data manipulation at different stages of the process.
It is worth noting that development cycles of R are very short, so new versions of modules and packages are being released constantly. The language itself is quality tested, due to the sheer number of contributors collaborating on the same projects, and because the language is open source you know how your solution works under the hood.
Similarly, Python is a toolbox of analysis and machine learning tools. It is only slightly more difficult to learn than R (but still fairly straightforward) but it is an excellent choice for all-purpose development as well as for data analysis and ML.
- Python integrates well with a wide range of external tools. Hadoop and Spark both have APIs for running jobs with Python code, and external tools like Google’s Tensorflow generally come with APIs built for Python users.
- It is less Windows-friendly than R is. This is probably fine if you develop in a Linux environment or plan on using a Linux-based production environment.
- It is capable of multi-threading and outperforms R in a lot of performances respects, however other technologies like Java, Scala and C seem to outperform Python on average.
- One of its main tools is pandas which manipulates data frames in a similar way to R’s dplyr. It is more object-oriented than R’s dplyr, which is more procedural in its nature. But it performs well with large data sets.
How to productionise machine learning in Python?
- Like in R, Python has libraries for a massive range of development cases. PySpark and MLlib can be used to train data models on Hadoop.
- You can use TextBlob and the Natural Language Toolkit to perform sentiment or text analysis and perform Natural Language Processing (NLP).
- You can create custom reports and visualise data with visualisation packages like plotly.
Like R, Python is free to use and development cycles are short, and one benefits from the quality control aspect and trusting that your code is doing exactly what you intend it to do.
Statistical Analysis System (SAS) is a collection of large-scale software solutions, developed by the SAS Institute. It requires a license and is a much more expensive option compared to an open-source solution, however some of its features include:
- Data is stored in tables, much like SQL databases.
- There is a GUI for non-technical users, meanwhile programmers can code directly.
- It is commonly used for information retrieval, report writing, developing applications, data warehousing, data mining and statistical analysis.
SAS has a distinct advantage over R for Big Data in that open-source R is single-thread and performs all of its operations in memory, while SAS is more robust on this front. In R, raw data is copied in memory into objects that the user can manipulate, and so programmers need to be aware of their memory usage.
How does one productionise SAS? SAS has around 200 modules that you can select from and utilise for a specific business problem. Some modules can potentially contain hundreds of procedures to use.
However, SAS is a licensed product and is an expensive option compared to languages like R and Python, that are free to use. Development cycles are typically longer, however the product is still frequently used for corporate use. Consolidata's preference is the open-source solutions, however you may want to do further research to see if SAS is right for you.
Typical Use Cases
Building data pipelines with R, Python and SQL
All of these tools have their distinct advantages. Because of its power in data wrangling, data cleaning and data manipulation, Python becomes a good choice for the extraction and transformation stages of ETL. There will always be a requirement for the housing of structured data in relational database technologies like SQL Server, but more serious volumes of semi-structured data may require a NoSQL database where Python does the heavy lifting and provides an interface to programmers.
One potential workaround for R users who want to work with big data is Microsoft R Server, which uses specialist packages like RevoScaleR which manipulate data sets spanning across millions of rows. RevoScale uses XDF files (external data frames) that are written to the disk and process these larger data sets in drastically faster times than normal single-threaded R.
Data retrieval and building analysis views on top of your data warehouse can probably be left to SQL Server and SQL Server Analysis Services for now. Tabular models for data mining and SQL Server both perform fantastically well for getting valuable insights out from data.
Both Python and R provide excellent capabilities for machine learning. If you are serious about using open-source tools for training and testing client data while benefiting from great performance, then Python is probably for you. However, Microsoft also provides excellent machine learning capabilities with Azure ML and Microsoft R Server.
Azure ML has a user-friendly interface where non-technical users can drag and drop specific modules into a workflow. They can load, clean data, apply training procedures, test their models, and feed the results back into their workload. R code can be integrated at various steps and output can be pushed into dashboard applications like Power BI.
For technical teams, Microsoft R Server has in-built functions in its libraries for linear regression and predictive analytics, providing some of the rudiments for you to build large-scale machine learning applications.
If I had the chance to go back in time, I would probably have done things differently. I started learning SQL (which I feel was a perfect start) and then swiftly moved onto R. The language was easy to learn and I got a bit of an introduction to OOP using it. However, given the choice again, I feel like I would have moved onto Python, due to its computational power and its omnipresence in development circles. I do not regret learning R, and as such I am in the process of learning both R and Python simultaneously.
The lesson is that if you are a business that wants to start becoming data-driven with open-source tools like R (or more advanced data platforms or premium analytics solutions) then your first steps will largely depend on the questions you want answering.