Data Science Festival: https://www.meetup.com/Data-Science-Festival-London/events/237401047/
GitHub containing the slides and code: https://github.com/olfrost/DataScienceWithRAndSqlServer
For those reading who are technical, the advantages of using relational databases should be fairly obvious by now. When dealing with structured data, the storage requires less space and more secure, and changes to the data are logged.
You are also gifted with a complex querying language. Not complex in the sense that it is difficult to learn or read (quite the opposite in fact) but in that you are not limited by the types of questions that you can ask. Commands are written with English-like phrases and are compiled into database code that the database engine uses later.
What can R do that SQL can’t?
R is a high-level and functional language developed by and for statisticians. There are over 6,000 libraries available to use for free on CRAN, so the chances are that if you have a business problem you want to try and solve with R then there is already a package out there – there’s no sense in duplicating work.
These include packages for data visualisation, text analysis, machine learning, complex statistical modelling, summarising and profiling data, performing audio and visual analysis, cleaning and reshaping data files, reading Word and Excel files, connecting to other database technologies and querying them, collecting social media data, connecting to rest APIs and data enrichment.
In short, R can handle semi-structured and non-structured data better than SQL can. It is also a high-level language that is easy to learn and explore data with.
How I use these tools together?
- Structured, modelled data lives in SQL, and the updating, insertion and deletion of data is done in SQL.
- Any changes to the data are logged, so for the modelling of relationships in data then SQL is a much more robust tool.
- R provides the analytics and a means of quickly finding answers to questions about data. Data is easy and quick to import, perform business analysis pieces on, or reshape and clean up.
- R provides an excellent analytics layer to your data warehouse solution. Find answers quickly and drive the next step of the business intelligence process.
R Scripts in SQL Server 2016
We have explored how to do this in previous blogs. To find out how to install R inside SQL Server 2016 (or a standalone R Server), then click here. To simplify, the execution of R scripts occurs in three basic steps:
- Declare an nvarchar(max) variable.
- Set the value said variable with our R script.
- Run sp_execute_external_script.
DECLARE @RScript NVARCHAR(max);
SET @RScript = N' output <- tolower(input)';
@language = N'R',
@script = @Rscript, -- input the R script.
@params = N'@r_rowsPerRead int = 500000',
-- specify how many rows to read in at a time from the input query.
@parallel = 1,
-- specify if you want the job to run in parallel, 1 or 0.
@input_data_1 = N'SELECT * FROM table',
-- feed the proc an input query from another SQL table.
@input_data_1_name = N'input',
@output_data_1_name = N'output'
WITH result sets ((
idcolumn int, charcolumn nvarchar(70)
-- define your output data types.
For data scientists, R is a powerful tool. Open source R, however, is not without its limitations. Microsoft R Server handles large data files and provides a feast of functionality for running data science jobs and processing in parallel. The following code should give you some insight into some of the key features of Microsoft R Server and the tools you have available to you to speed up big data analysis in R.
Microsoft R Server provides a faster alternative for reading data. External data frames (XDFs) are compressed data frames that sit on the disk, and once converted from a flat file into an .XDF file then performance benefits can be seen pretty much immediately.
#### Author: Ollie Frost
#### Description: Performance comparisons of base, data.table and RevoScaleR.
#### Timings on an instance of SQL Server 2016 Developer Edition with 4GB RAM.
#### 45 columns and 400,000+ rows.
# Reading in with data.table.
df <- fread("file.csv") # 5 seconds
# Reading in with base.
df <- read.table("file.csv", header = TRUE, sep = ",") # 15-16 seconds
# Read in with RxTextData.
df <- RxTextData("file.csv") # 3 seconds
# Converting to XDF.
df <- rxDataStep(inData = "file.csv", outFile = "file.xdf") # takes a while initially, but long term gains are key here.
# Read the data frame from the XDF file.
df <- rxXdfToDataFrame("file.xdf") # 2-3 seconds.
# Final summary object.
dfSummary <- rxSummary(~., "file.xdf") # 2-3 seconds.
XDFs can be read in in blocks of rows. For example, a text file with 10 million rows can be read in in blocks of 500k and any further processing or analysis on that files will take place in parallel across the blocks. Run ?RxTextData in an R console to get help on the full functionality.
Once your files are loaded, you might want to explore these functions yourself, depending on what you want to do:
- rxHadoop family – copy files to and from a Hadoop cluster, run jobs, run Hadoop commands, check the status of nodes, list files.
- rxPredict – plus a whole range of statistics methods available to you like functions for Random Forests and Chi Squared. Scalable data science functions that process across multiple nodes.
- rxSpark – create a Spark context.
- RxSpssData and RxSasData – analyse data from other data analysis tools.
R and SQL are very compatible.
In fact, R and a lot of other amazing tools are very compatible. However, as a business intelligence analyst or developer who works with databases, R and SQL together are a supreme combination.
While R in Microsoft R Server is still evolving, people are realising the power of R and SQL’s partnership, which is gradually becoming more and more adept for big data and high performing enterprise solutions.