R Part 1: Installing and Using R

#

I decided to put together a group of blogs on how I created the Shiny web applications on the Consolidata Data Science Portal.

This is part 1 and is a simple step by step guide to getting R up and running to allow you to run some R code from the R console. It is a good way to get R up and running on your machine. The data used is from Wikipedia and is for illustration only. At the bottom I have listed some great sites to learn more R.

Installing R

When using R the first thing to do is install the R application.

To do this browse to the following location to download R from https://cran.r-project.org/

Choose the version compatible with your operating system, download and install.

To test if R is installed correctly you can run some commands against R in command line, to do this you can load the command prompt in windows and browse to the install location, below that location is

c:\Program Files\R\R-3.2.0\bin\

From here you can run the R application by typing

r

1 Rconsole1

 

To check that R is actually running you can pass a command to the console, for example if you type

help()

If R is running correctly the browser should open with documentation on the R help documentation.

This is an extremely useful function which allows you to pass topics or packages to. For example

help(plot)

This will open the help documentation on r plots, a method in R for plotting objects.

Another useful command in the beginning is example. This allows you to run examples on a topic in the R help library, if one exists. For example

example(plot)

This will run some R plot examples, so you can easily see what the function does. The window which opens allows you to select Next>Next Plot which will show another example if it exists.

2 ExampleGraph

3 ExampleGraph

 

builtins()

Show built in functions available in R.

options()

Allows user to view or set options on how R works.

 

Vectors and Dataframes

In R one of the most useful objects to store data is a dataframe. Dataframes allow you to store data in a data table. We can simply pass data into a dataframe and then manipulate or retrieve that data.

To add data to the data frame I am going to create two vectors. A vector is a basic data structure with a single data type. To create the vectors I am using the following code. Note the use of <- in R this is how we assign values

Countries <- c("China", "India", "United States", "Indonesia", "Brazil")
Population <- c(1364580000, 1244390000, 318081000, 247424598, 202593000)

Now I have two vectors, one called Countries and another called Population. I can check the structure of an R object using str

str(Countries)
str(Population)

3 str1

4 str2

This shows me I have a vector of type Char called Countries and a vector of type number called Population. Both vectors are populated from index 1 to 5.

I want to create a dataframe called Countries.Statistics, using the data.Frame function to which I can pass my vectors.

Countries.Statistics <- data.frame(Countries,Population)

Again I can run str to check the object in R

str(Countries.Statistics)

5 str3

I can also use the summary function to get a summary of the dataframe. This summary shows the number of occurrences of each country in the countries column and the min, max, mean, median of the population column. It also shows the start of the 1st and 3rd quartiles.

summary(Countries.Statistics)

6 Summary

Kind of useful on a small dataframe like this example, but very useful when we move to datasets containing large amounts of data.

To print out the dataframe in a table format I can simply type the name

Countries.Statistics

7 dataframe

To list out the values in a column I can use the $ symbol

Countries.Statistics$Population

8 dataframecolumn

I can add a new column for CO2 to the dataframe

Countries.Statistics$CO2 <- c(10330000,2070000,5300000,510000,480000)

9 dataframeNEWcolumn

I can rename the column CO2 using the code below

names(Countries.Statistics)[names(Countries.Statistics)=="CO2"] <- "CO2emissionsKT"

9 dataframeRENAMEcolumn

I can then drop the new column if I want using

Countries.Statistics$CO2emissionsKT <- NULL

For now I will keep this column.

I can also select a subset of columns or filter the rows. I can return two of the three columns by running the following code

subset(Countries.Statistics, select = c(Countries, CO2emissionsKT))

10 SUBSET

I can select only the country Brazil using

subset(Countries.Statistics, Countries=="Brazil")

11 SUBSETFILTER

 

Math and Statistical functions

Now I have a dataframe containing my data I can use some of the maths and statistical functions to analyse the data. For example, I can apply mathematical calculations or calculate the mean, median, min and max. I am going to list the commands without explanation, the explanations should be obvious for most.

10 + 10 # Add 2 numbers
Countries.Statistics$Population * 10    
Countries.Statistics$Population / 100000                            
Countries.Statistics$Population > Countries.Statistics$CO2emissionsKT               
min(Countries.Statistics$Population)                                    
max(Countries.Statistics$Population)                                   
mean(Countries.Statistics$Population)                                               
median(Countries.Statistics$Population)                            
sd(Countries.Statistics$Population)                                       
sum(Countries.Statistics$Population)                                   
(Countries.Statistics$Population / sum(Countries.Statistics$Population)) * 100
Countries.Statistics$PercentageOfTop5 <- (Countries.Statistics$Population / sum(Countries.Statistics$Population)) * 100

12 MATHSNSTATS

 

Graphs

One of the most useful features of R is plotting graphs. R has a lot of graph options built in and also a wealth of graphs available in user libraries. For now I will look at some of the built in graph functions.

The simplest graph to produce is the scatterplot using plot().  Everything in the dataframe will be plotter as we have not specified the structure.

plot(Countries.Statistics)

13 GRAPH1

Alternatively I can plot the data with a little more structure

plot(Countries.Statistics$Population, Countries.Statistics$CO2emissionsKT, xlab = "Emissions(KT)", ylab = "Population(Millions)")

14 GRAPH2

We can then fix up the scientific axis labels using the scipen option. Setting this to a higher number will make the axis less likely to have scientific notation like above.  I have also added labels to the points to show which country each point represents and a title using the main parameter. A regression line is added using abline.

options("scipen" = 10)
plot(Countries.Statistics$Population, Countries.Statistics$CO2emissionsKT, xlab = "Emissions(KT)", ylab = "Population(Millions)", main = "Population and Emissions")
text(Countries.Statistics$Population, Countries.Statistics$CO2emissionsKT, labels=Countries.Statistics$Countries, cex= 0.7, pos=3)
abline(lm(Countries.Statistics$CO2emissionsKT~Countries.Statistics$Population), col="green")

15 GRAPH3

Another useful option for this set is barplot.

barplot(Countries.Statistics$Population, main="Population", horiz=FALSE,  names.arg=Countries.Statistics$Countries)

16 GRAPH4

Lot of other examples of graphs in the default R libraries. Many can be found on the links below.

Part two will look at installing RStudio and using it to create some Shiny web applications. This will allow us to create more advanced graphical outputs and publish to a server.

 

Useful R resources

Latest from this author