R and Hadoop on the Data Platform


There is a lot to get excited about with the R language nowadays. 

In the keynote speech at SQL Bits this year, Joseph Sirosh of Microsoft excitedly showcased some of the new features in SQL Server 2016, including the ability to run external R scripts and load libraries to perform tasks like machine learning and analysis. This new feature of SQL Server is creeping its way into other Microsoft technologies like Azure – in fact, Microsoft have become quite fond of R since they acquired Revolution R (now known as Microsoft R).

R is hot right now among data scientists and developers, but what does it add to relational database technologies like SQL? And how does Consolidata use R inside its data platform?


R is an open-source language with thousands of contributors around the world. It provides extra flexibility for analysts and data platforms by using readily available and well-documented libraries for cleaning data, reshaping data and load it into a structured format for other uses. It also allows you to handle different data types like semi-structured data (JSON) and unstructured data, as opposed to SQL which focuses on storing structured data in relational tables.

R also allows you to plot data, create reporting solutions and is highly portable. In fact, you can do pretty much anything with it. You can learn more about R by clicking here, or on the image below.



So how does R add value?

Data Enrichment

R allows you to make calls to APIs and get publicly available data from sites like Google, LinkedIn, Facebook, Twitter, Salesforce, Eventbrite, Blogger and thousands of others. By creating a process for collecting this data, storing it, analyzing it and merging it to your existing data sources, you can start to assess causality in your existing data sets and build more predictive models based on the new data coming in.


Click here or the link above to see an example of using external sources of data to add values to existing data sets.

Machine Learning

R is just one of the statisticians’ weapons of choice when it comes to building machine learning models, training them on data sets and then testing out their predictive power with new data. It is a common tool used by data scientists and is a frequently used tool in Kaggle competitions, where groups compete to win prizes by building the best machine models from available data sets.


R also plugs nicely into the Azure Machine Learning libraries, allowing you to perform any number of actions on your data before passing it into a training model to make predictions. Click here to contact us about a predictive analytics solution for your company.

Data Profiling

Detect data types as it arrives on the platform, interrogate it, and process it appropriately. Quickly summarize the data that lands, and even visualize it with packages like ggplot2 and plotly) and if suitable, pipe it to another process.

ProfVisExample 1

This can all be done in R and wrappers like SparkR, that utilize multi-threaded processing to profile much larger files. For example, the image above was created from a profiling tool called profvis, which is profiling a piece of code written in normal R that counts the letters in War and Peace.

Integration with Hadoop

The Consolidata Data Platform (and the Consolidata Data Lake) is powered by a Hadoop cluster, in which data consisting of all file types are stored managed efficiently, distributed across machines. Applications like R and SparkR sit on top of this cluster, aggregating data automatically as it lands.

Some will argue that dirty data isn’t valuable, no matter how much data there is. If it can’t be read in, then it can’t be processed. You can utilize R with Scala, the language that powers big data processing engines like Apache Spark, and clean large volumes of data.

ds stories for sept dreamstime xs 28702685

Click here find out more about the platform, or learn how to set one up a Hadoop cluster with our CTO, Gordon Meyer.

Without getting too technical, you can use R to clean the data and pipe the data to a Spark script that processes the actual data, utilizing its tremendous power to run programs up to 100x faster than Hadoop MapReduce in memory.

R can handle the reading of thousands of files with similar structures in just a matter of seconds, all with just a few lines of code:

  fread("hadoop fs -cat '/path/to/files/*.csv'", sep = ",", header = TRUE)

And by using SparkR, it works just as well with reading large files and distributes the job across multiple machines:

# sparkR --packages com.databricks:spark-csv_2.10:1.2.0
df <- read.df(sqlContext, "/path/to/file/data.csv","com.databricks.spark.csv", header="true")

We can even ‘cheat’ and run SQL queries:

sample <- sql(sqlContext, "SELECT fConc, Class, fAlpha FROM df WHERE fAlpha > 1")

Latest from this author