I decided to put together a group of blogs on how I created the Shiny web applications on the Consolidata Data Science Portal.
This is part 1 and is a simple step by step guide to getting R up and running to allow you to run some R code from the R console. It is a good way to get R up and running on your machine. The data used is from Wikipedia and is for illustration only. At the bottom I have listed some great sites to learn more R.
When using R the first thing to do is install the R application.
To do this browse to the following location to download R from https://cran.r-project.org/
Choose the version compatible with your operating system, download and install.
To test if R is installed correctly you can run some commands against R in command line, to do this you can load the command prompt in windows and browse to the install location, below that location is
From here you can run the R application by typing
To check that R is actually running you can pass a command to the console, for example if you type
If R is running correctly the browser should open with documentation on the R help documentation.
This is an extremely useful function which allows you to pass topics or packages to. For example
This will open the help documentation on r plots, a method in R for plotting objects.
Another useful command in the beginning is example. This allows you to run examples on a topic in the R help library, if one exists. For example
This will run some R plot examples, so you can easily see what the function does. The window which opens allows you to select Next>Next Plot which will show another example if it exists.
Show built in functions available in R.
Allows user to view or set options on how R works.
Vectors and Dataframes
In R one of the most useful objects to store data is a dataframe. Dataframes allow you to store data in a data table. We can simply pass data into a dataframe and then manipulate or retrieve that data.
To add data to the data frame I am going to create two vectors. A vector is a basic data structure with a single data type. To create the vectors I am using the following code. Note the use of <- in R this is how we assign values
Countries <- c("China", "India", "United States", "Indonesia", "Brazil")Population <- c(1364580000, 1244390000, 318081000, 247424598, 202593000)
Now I have two vectors, one called Countries and another called Population. I can check the structure of an R object using str
This shows me I have a vector of type Char called Countries and a vector of type number called Population. Both vectors are populated from index 1 to 5.
I want to create a dataframe called Countries.Statistics, using the data.Frame function to which I can pass my vectors.
Countries.Statistics <- data.frame(Countries,Population)
Again I can run str to check the object in R
I can also use the summary function to get a summary of the dataframe. This summary shows the number of occurrences of each country in the countries column and the min, max, mean, median of the population column. It also shows the start of the 1st and 3rd quartiles.
Kind of useful on a small dataframe like this example, but very useful when we move to datasets containing large amounts of data.
To print out the dataframe in a table format I can simply type the name
To list out the values in a column I can use the $ symbol
I can add a new column for CO2 to the dataframe
Countries.Statistics$CO2 <- c(10330000,2070000,5300000,510000,480000)
I can rename the column CO2 using the code below
names(Countries.Statistics)[names(Countries.Statistics)=="CO2"] <- "CO2emissionsKT"
I can then drop the new column if I want using
Countries.Statistics$CO2emissionsKT <- NULL
For now I will keep this column.
I can also select a subset of columns or filter the rows. I can return two of the three columns by running the following code
subset(Countries.Statistics, select = c(Countries, CO2emissionsKT))
I can select only the country Brazil using
Math and Statistical functions
Now I have a dataframe containing my data I can use some of the maths and statistical functions to analyse the data. For example, I can apply mathematical calculations or calculate the mean, median, min and max. I am going to list the commands without explanation, the explanations should be obvious for most.
10 + 10 # Add 2 numbersCountries.Statistics$Population * 10Countries.Statistics$Population / 100000Countries.Statistics$Population > Countries.Statistics$CO2emissionsKTmin(Countries.Statistics$Population)max(Countries.Statistics$Population)mean(Countries.Statistics$Population)median(Countries.Statistics$Population)sd(Countries.Statistics$Population)sum(Countries.Statistics$Population)(Countries.Statistics$Population / sum(Countries.Statistics$Population)) * 100Countries.Statistics$PercentageOfTop5 <- (Countries.Statistics$Population / sum(Countries.Statistics$Population)) * 100
One of the most useful features of R is plotting graphs. R has a lot of graph options built in and also a wealth of graphs available in user libraries. For now I will look at some of the built in graph functions.
The simplest graph to produce is the scatterplot using plot(). Everything in the dataframe will be plotter as we have not specified the structure.
Alternatively I can plot the data with a little more structure
plot(Countries.Statistics$Population, Countries.Statistics$CO2emissionsKT, xlab = "Emissions(KT)", ylab = "Population(Millions)")
We can then fix up the scientific axis labels using the scipen option. Setting this to a higher number will make the axis less likely to have scientific notation like above. I have also added labels to the points to show which country each point represents and a title using the main parameter. A regression line is added using abline.
options("scipen" = 10)plot(Countries.Statistics$Population, Countries.Statistics$CO2emissionsKT, xlab = "Emissions(KT)", ylab = "Population(Millions)", main = "Population and Emissions")text(Countries.Statistics$Population, Countries.Statistics$CO2emissionsKT, labels=Countries.Statistics$Countries, cex= 0.7, pos=3)abline(lm(Countries.Statistics$CO2emissionsKT~Countries.Statistics$Population), col="green")
Another useful option for this set is barplot.
barplot(Countries.Statistics$Population, main="Population", horiz=FALSE, names.arg=Countries.Statistics$Countries)
Lot of other examples of graphs in the default R libraries. Many can be found on the links below.
Part two will look at installing RStudio and using it to create some Shiny web applications. This will allow us to create more advanced graphical outputs and publish to a server.