This blog outlines two easy ways to generate plots of multiple time series; one with the series on the separate plots, and one with the series on the same plot. It then considers whether there is a relationship between the two example variables.
Since it’s St Patrick’s day as I write this, let’s look at these two variables:
Googles searches for the word ‘Guinness’, and Google searches for the phrase ‘St Patricks Day’ (apostrophe not important).
Question: When the search term ‘St Patrick’s Day’ spikes in popularity, does the popularity of the search term ‘Guinness’ also spike?
We can begin to answer this question by plotting a time-series of these two variables.
If you’re interested this data can be found here and can be downloaded as CSV.
Here is a snapshot of some of the data:
The code was read into R using the read.csv() function and was then formatted as time-series using the ts() function as follows:
ts <- read.csv("C:/Users/Consultant/Desktop/R Data/ts_stpatricks.csv") # subset the columns we want to analyse ts_searches <- ts[,2:3] # transform into time series format relative_search_interest <- ts(data = ts_searches, start = 2004, freq = 52, names = c("St Patricks Day", "Guinness"))
The key thing here is the names() function. It allows us to name multiple time series which can then be plotted together.
The data is in weeks and so the frequency of observations per year is 52, starting in 2004 and ending on St. Patrick’s Day 2016 (the day this data was harvested).
Now comes the need to plot. Think about what your intuition tells you about the relationship when you look at these different plots.
We can use a very simple plot.ts() function to generate two graphs in the same plot-space:
And if we want to plot them on the same graph, we can use the ts.plot() function. Yes, it’s literally the plot.ts() function swapped around.
ts.plot(relative_search_interest, gpars = list(col = c("red", "blue"), ylab = "Relative Search Interest"))
Now at present these are not publishable graphics, but using these functions is a good place to start when you’re looking to plot multiple time series.
But is there really a relationship between the popularity of these two search terms?
What’s interesting is that when the plots are overlaid, the relationship looks much stronger than when the series are presented separately. This is because some of the peaks in searches for ‘Guinness’ seem to correspond with the obvious seasonality in searches for ‘St Patricks Day’.
But let’s not let these lines deceive us. Let’s do some regression analysis to test the strength of the relationship:
Let’s run a linear regression as follows:
reg1 <- lm(guinness ~ st.patricks.day, data = ts_both) summary(reg1)
Which gives the following summary:
Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 9.770717 0.071767 136.15 <2e-16 *** st.patricks.day 0.111599 0.006021 18.54 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 1.736 on 635 degrees of freedom Multiple R-squared: 0.3511, Adjusted R-squared: 0.3501
Here we’ve got a very low p value, suggesting that there is a statistically significant relationship between the popularity of searches for St Patrick’s Day and Guinness. However, the adjusted R squared is low, telling us that the regression on ‘St Patrick’s Day’ only explains around 35% of the variation in ‘Guinness’.
To see this let’s plot a straight line through these two variables:
plot(ts_both$st.patricks.day, ts_both$guinness, pch = 18, col = "gray", ylab = "Guiness Search Interest", xlab = "St Patrick's Day Search Interest", main = "Realtive Search Interest: Guinness and St. Paddy's", abline(lm(guinness ~ st.patricks.day, data = ts_both), col = "darkred", lwd = 2))
Looking at this, if we know the number of searches for ‘St Patrick’s Day’, can we be fairly confident in the number of searches for ‘Guinness’? No. But that doesn’t mean we can’t examine the relationship further.
Let’s go deeper into time series and split up the variables into various components.
We can decompose the time series into its seasonality, general trend, and random component using the decompose() function as follows:
components_paddys <- decompose(relative_search_interest[,1]) components_guinness <- decompose(relative_search_interest[,2])
And produce the following plots:
Now let’s use the seasonality components to produce another overlaid time series:
ts.plot(components_paddys$seasonal, components_guinness$seasonal, gpars = list(col = c("red", "blue"), ylab = "Seasonality", main = "Seasonality in Search Term Popularity"))
What can we say about this?
Both of these series have seasonal trends that seem to occur at the same time throughout the year. Perhaps people’s St Patrick’s Day piques a general interest in Guinness.
It seems sensible then to say that, when there is a spike in the popularity of ‘St Patrick’s Day’, there is likely to also be a spike in the popularity of ‘Guinness’.
More importantly, plot.ts() and ts.plot() functions are useful in providing basic graphical representations of time series data, and I hope you find them as useful as I do.
After all that I feel it’s time for a drink. I just can’t work out what to have…