This blog will show you how to use Microsoft R to dynamically extract metadata and manipulate it using the Microsoft R RevoScaleR library.
If you have attended any of my presentations around Microsoft R, you may remember I have recommended that developers install a standalone R Server instance using the SQL Server 2016 Developer Edition installer.
There is actually no need to do this – if you want to try out the proprietary R Server functions for yourself for free, install Microsoft R Client following the instructions here. If you value your privacy, you may consider running this piece of code after you install, too:
Run ?rxPrivacyControl to find out more about the data collection Microsoft performs.
As a data engineer, I regularly wrangle with survey files with data from real-life respondents to market research surveys or questionnaires. A common tool for the data collection and ad hoc analysis of this data is SPSS, a licensed tool developed by IBM. Traditionally, these files are stored as .SAV files which require special software to read and interpret.
If you want to analyse and model this kind of data properly, it’s a good idea to model the responses as well as the metadata that comes with these files. To do this in the past I have used R, painfully deploying a combination of techniques and reverse-engineered solutions from much older and (sometimes) temperamental packages like foreign, memisc and Rz. It doesn’t always work and is quite a fiddly job to get encoding and file writing just right.
This blog will show you how to use Microsoft R to dynamically extract this metadata and manipulate the data into a useable format, using the new RxSpssData function.
Here’s the script, with some explanation on what’s going on:
- The developer specifies a path containing the survey files they want processing. The file names with the .sav suffix are extracted from this folder and the developer picks his/her favorite.
- If it doesn’t exist, an XDF sub-directory is created and the name of the XDF is decided.
- RxSpssData and rxDataStep are called to stream the data from the .sav file directly into an .xdf file. The input parameters ensure that the metadata is made available as attributes – see here for more.
- One approach is to pull the whole data set and get the information that way from the attributes in the data frame columns. A better and more efficient solution is to call rxGetVarInfo()
- Extract the data with a number of base functions and wrangle into two data frames.
- rxMerge the two data frames together. NB: Has to be two XDFs or two data frames, not one of each, which is a shame.
Example output: https://github.com/olfrost/DataScienceWithRAndSqlServer/blob/master/dfFinal.csv
Missing is missing
Unfortunately, coding for missing values is a bit trickier. This input parameter inside RxSpssData:
mapMissingCodes = "none"
Tells RevoScaleR to include all of the answer labels and answer values that are marked by the surveyor as “missing”. The default behaviour is to convert these values to NA (or use the first missing value it finds – see this.) Currently, getting this data requires creating different XDFs using different combinations of input parameters on RxSpssData and then merging the result. Long-winded and inefficient.
But this is certainly a massive improvement on the older methods of reading and analysing SPSS files in R, especially with the parallel processing and multithreading benefits of Microsoft R Server. A bit unfinished, but still a very useful set of functions.