Using Microsoft R Server at Scale

#

I have become a particular fan boy of Microsoft R Services in recent months.

However, one valid complaint is its lack of indexing, which puts its ability to scale and handle large datasets into question somewhat. Granted, open source tools like Python can accommodate this quite well. For example, pandas provide the capability for indexing across large data volumes and is performant because of it.

The purpose of this blog is to show you how one typically uses the R component of SQL Server when analysing big data files. If you are in BI and want to use R for more heavy duty data processing, or if you are new to the R language, there should be something in this post for you.

 

The Problem

R is single-threaded. One of its strengths as a tool is its capability to perform tasks in all in memory. However, with base functions you might start to see your machine struggle when performing certain operations in memory with anything up to 5-10 million rows, depending on what you are doing.

 

library(data.table)

This Hadleyverse (Tidyverse?) library handles data frames in a slightly different way that is more memory efficient - one typically builds data tables rather than data frames. A data table is essentially a data frame (you can read more about this in the documentation) but avoids using row names and allows for indexing on named columns. This speeds up general filtering, data wrangling and doing ad hoc analyses with data.

·       Fast data importing – roughly a million rows every 2-3 few seconds compared to read.table (anywhere between 10-15s).

·       Faster functionality – pivoting and unpivoting functions (dcast and melt) are more memory efficient.

When working in RStudio and when working with data tables, you can check to see which objects contain a key and which do not:

library(data.table)

 

# Read in data with the fread() function and then create a dat with the data.table() function.

wcaPersons <- fread("wcaCompetitors.tsv")

key(wcaPersons) # Returns NULL

 

wcaPersonsIx <- data.table(wcaPersons, key="id")

key(wcaPersonsIx) # Returns “id”.

 

CREATE EXTERNAL RESOURCE

You can configure the amount of memory available to your R component and adjust the CPU utilisation by creating an external resource with the following syntax:

CREATE EXTERNAL RESOURCE POOL rResourcePool

WITH

(

       MAX_CPU_PERCENT = 90,

       AFFINITY CPU =AUTO,

       MAX_MEMORY_PERCENT = 90

);

 

ALTER RESOURCE GOVERNOR RECONFIGURE;

GO

These can be scripted out like other objects when you right click on a particular database – they are found under External Resources.

 

library(RevoScaleR)

But perhaps the most scalable option at the moment seems to be the in-built functionality of Microsoft’s ScaleR packages. Microsoft uses external data frames (.xdf) to store data frames and read from disk instead. This data is compressed and can be processed in blocks.

DECLARE @RScript nvarchar(512);

 

SET @RScript =N'

library(RevoScaleR)

library(RJSONIO)

 

## Read in an Xdf into memory if you prefer.

wcaResultsXdf <- rxReadXdf("C:/OF/wcaResults.xdf")

 

## Get a summary.

wcaSummary <- rxSummary( ~. , wcaResultsXdf)

 

## Output as JSON.

wcaSummaryJson <- data.frame(toJSON(wcaSummary))

';

 

EXEC sp_execute_external_script

       @language = N'R',

       @script = @RScript,

       @output_data_1_name = N'wcaSummaryJson'

WITH RESULT SETS ((

       wcaSummaryJson varchar(max)

       ))

rxGetInfo() can be used to extract metadata about a particular data frame, including the number of rows, blocks and variable information. Other rx…() functions use this pre-calculated metadata to speed up other operations.

rxSummary() provides descriptive statistics about any column names you provide to it via a formula. It will do basic counts for string factors in each column, described as categorical data, or you can create a summary table with relative max, min, mean, median and missing value calculations. These two functions can be combined to create a very powerful data profiling tool!

 

Streaming in data from SQL tables

If you want to use a SQL query as input, you can use the @r_rowsperRead inside the @params collection. This is particularly useful if you want to stream in millions or billions of rows of data to be trained for a machine learning model or some sort of predictive analysis.

DECLARE @RScript nvarchar(512);

 

SET @RScript =N'

library(RevoScaleR)

 

# Perform some sort of predictive modelling.

someOutput <- rxGlm("best ~ eventId + roundId + personCountryId", wcaInput)

 

# Return some sort of output table (I’m not a statistician, can you tell?)

someOutput <- data.frame(

  someOutput$coefficients,

  someOutput$aliased

)

';

 

EXEC sp_execute_external_script

       @language =N'R',

       @script = @RScript,

       @params =N'@r_rowsPerRead int = 500000',

       @input_data_1 =N'SELECT * FROM [feed].[wcaResults]',

-- where [feed].[wcaResults] has 5m+ rows.

       @input_data_1_name =N'wcaInput',

       @output_data_1_name =N'someOutput'

 

Speak to an expert

On the Consolidata Data Platform, Microsoft R Server is becoming an increasingly useful and powerful tool for managing medium to large data files. The RevoScaleR package’s ability to infer the field and row terminators, R’s ability to clean and reshape data files and SQL Server’s performance provides robustness for ETL.  This means we can read in pretty much any data file and profile it instantly.

Right now, I am seeing Microsoft R Server’s potential for data scientists. R is an easy language to learn and the ease of entry for someone with SQL knowledge is great. How it will fare with more serious data management business cases and big data problems in the future is still yet to be seen!

Click the links on the left hand side of the page to explore similar topics around R, SQL Server and big data.

Related articles

Speak to an expert

Head of Engagement

Dafydd Biffen

Contact Dafydd

Latest from this author

Our technologies

The data platform can be thought of as a ‘black box’ of tools for data management, data analytics and decision-making. It is the product of our years of experience in business intelligence and analytics, programming, data science, reporting and big data. Plug in your business problem and the platform will apply the best tools for your application.

OUR TECHNOLOGIES