Twitterbot in R and NEO4j (Part 3/3)

#

Welcome back to the conclusion of my mini series of integrating R and Neo4j to analyze social data.

Over the long weekend, I have been having a serious play around with different methods of storing and analyzing the data accumulated from our activities on Twitter, so that Consolidata can continue to use data to drive the way it engages on the rest of our social media channels and find out how users interact with us.

Data Model

In the second part, I showed you how to load the data and start querying it. Since we fill be exploring concepts like text mining and semantic analytics, I have created a new data model that allows us to analyze others’ tweets as well as our own. If you would like to run the R code to get yours and your followers’ and following’s tweets, then you can follow the steps in this GitHub repository.

data model

MATCH (t:Tweet)<-[:TWEETED]-(n:Home)-[:FOLLOWS]->(n1:User{screenName:'OFrost'})-[:TWEETED]->(t1:Tweet)
OPTIONAL MATCH (t2:Tweet)<-[:FAVORITED]-(n:Home)
RETURN t,t1,t2,n,n1
LIMIT 1 // a look at the data model I will be using in this part.

From this new model, I was able to analyse data from 61,218 tweets and 467 users including @ConsolidataLtd.

Neo4j Insights

Before we go into the text mining of the tweets, let’s wet out appetites and pull some interesting insights from our Neo4j database.

In our immediate network, who mentions ConsolidataLtd the most?

MATCH (n:User)-[:TWEETED]->(t:Tweet)
WHERE t.text CONTAINS 'Consolidata'
RETURN n.screenName AS User, count(t) AS Mentions 
ORDER BY Mentions DESC 
LIMIT 10
UserMentions
SaffaInUK 37
dafyddbiffen 27
roger_light 16
TheSurrealFish 14
mmontgomeryBI 14

What percentage of our followers do we follow back?

MATCH (n:Home)-[:FOLLOWS]->(n1)
WITH n1, size((n1)-[:FOLLOWS]->(n)) as fols
RETURN sum(fols)/toFloat(count(n1)) as `Percentage`
Percentage
0.5107296137339056

And what percentage of those who we follow, follow us back?

MATCH (n:Home)<-[:FOLLOWS]-(n1)
WITH n1, size((n1)<-[:FOLLOWS]-(n)) as fols
RETURN sum(fols)/toFloat(count(n1)) as `Percentage`
Percentage
0.28743961352657005

Consolidata’s most favorited tweets?

MATCH (n:Home)-[:TWEETED]->(t:Tweet)
RETURN t.text as Tweet, t.favCount as Favorites
ORDER BY Favorites DESC
LIMIT 7
TweetFavorites
Fun with animated #graph #database and how powerful they can be, including a live example. #neo4j #d3js http://t.co/5u0xhUZCm1 6
Combining #neuroscience with #bigdata analytics https://t.co/cppgYGXT9f 6
fantastic to hear #entrepreneur #success … http://t.co/caePGNXF1f 5
Good piece on what exactly might be connected to the IOT by Venture Beat http://t.co/wGQp8DiCUK #IoT #datascience #data 5
We are delighted to partner with @DofE in their 60th year, bringing them their next generation of #DataAnalytics systems! 4
Amusement at the start of @InteropLondon with ‘robbie the robot’ http://t.co/MPPm0YqR6o 4
Using #opendata, @OrdnanceSurvey has created a map of Mars! Plus other curious visualisation stories in this post – https://t.co/SopVIHlTQU 4

 

Back to R

Let’s create a data frame in R containing only our tweets so we can begin analysing our hashtags. We can compare our use of hashtags to our network as a whole. I extracted just Consolidata’s tweets like so, split the tweets by using spaces as delimiters, while also loading the correct packages for the rest of the article:

library(RNeo4j)
library(tm)
library(wordcloud)
library(slam)
library(ggplot2)
# Connect to the graph. ---------------------------------------------------
graph <- startGraph("http://localhost:7474/db/data","neo4j","password")
query <- "MATCH (h:Home)-[:TWEETED]->(n:Tweet) RETURN n.text as tweets;"
# Get all of the tweets from your database into a data frame ------------------
tweets <- cypher(graph,query)
# Extract the text from the tweets column in the data frame and 
# apply a function that splits the character string, using spaces as the delimiter. tweets.split <- strsplit(tweets$tweets," ") tweets.split.all.words <- unlist(strsplit(tweets$tweets," "))

What are Consolidata’s most frequently used hashtags?

hashtags <- grep("^#", tweets.split.all.words, value=T)
hashtags <- data.frame(table(hashtags))
hashtags.top15 <- head(hashtags[order(-hashtags$Freq),],15)

top15hash

And how does it compare to the whole of our network?

query <- "MATCH (u:User)-[:TWEETED]->(n:Tweet) RETURN n.text as tweets;"
tweets <- cypher(graph,query)
tweets.split <- strsplit(tweets$tweets," ")
tweets.split.all.words <- unlist(strsplit(tweets$tweets," "))
hashtags <- grep("^#", tweets.split.all.words, value=T)
hashtags <- data.frame(table(hashtags))
hashtags.top15 <- head(hashtags[order(-hashtags$Freq),],15)

top15hashAll

From this, you will notice that Consolidata is not frequently using many of the most common hashtags. Much of our hashtag use was from a single event – the SQL Saturday in Exeter we recently attended. So perhaps with time, the hashtag topics will even out to something a bit more varied than SQL and SQL Server.

You could argue whether using common hashtags is a good tactic for making your tweets more visible to the right people (or even the value of using hashtags in general). But in this case, it highlights potential new topics of conversation for Consolidata to explore like:

  • #marketing
  • #socialmedia
  • #CIO
  • #digital
  • #analytics

Or we could see what other users’ most common hashtags are?

With a bit of R trickery:

# Get all tweets and their user's screen names.
query <- "MATCH (u:User)-[:TWEETED]->(n:Tweet) RETURN u.screenName as User, n.text as Tweet"
tweets <- cypher(graph,query)
# Create a function that expands the list so that each hashtag is in a separate row.
hashsplit <- function(x, a = 1, b = 2) {
 rownum <- 1
 expand_row <- function(y) {
 fact <- y[a]
 stringout <- toString(y[b])
 expand <- strsplit(stringout, " ")[[1]]
 expand <- grep("^#", expand, value=T)
 num <- length(expand)
 factor <- rep(fact,num)
 return(as.data.frame(cbind(factor,expand),row.names=seq(rownum:(rownum+num)-1)))
 rownum <- (rownum+num)-1
 }
 expanded <- apply(x, 1 ,expand_row)
 df <- do.call("rbind", expanded)
 names(df) <- c(names(x)[a],names(x)[b])
 return(df)
}

expanded.hashtags <- hashsplit(tweets)

I then cheat slightly and use a SQL statement to get a count of each hashtag per user, and then create a pivot table of those hashtags, like so:

library(reshape) # install.packages("reshape")
 library(sqldf) # install.packages("sqldf")
 # output hashtags that have been used at least 5 times to a data frame, and then melt that data frame into a pivot table, to create a sort of 'hashtag profile' for each user of the most common hashtags in our network.
 output <- sqldf("SELECT User, Tweet, count(1) as Freq FROM `expanded.hashtags` GROUP BY User, Tweet HAVING count(1) > 5;")
 pivot <- cast(output, User ~ Tweet)
df <- data.frame(
 User = pivot$User,
 BigData = pivot$`#BigData`,
 IoT = pivot$`#IoT`,
 DataScience = pivot$`#DataScience`,
 bigdata = pivot$`#bigdata`,
 data = pivot$`#data`,
 MachineLearning = pivot$`#MachineLearning`,
 datascience = pivot$`#datascience`,
 marketing = pivot$`#marketing`,
 digital = pivot$`#digital`)

There are a lot of NA values because I limited the search to hashtags that have been used over 5 times (and because not everyone is using every hashtag). However, this layout means I can very quickly look up a hashtag and see how frequently each user tweets it.

pivotexample

Let’s tidy it up and make a heatmap.

# clean your NAs and make a nice heatmap with ggplot.
df[is.na(df)] <- 0
df.new <- head(df[order(-df$BigData, -df$data, -df$DataScience),],20)
pivot.new <- melt(df.new)
ggplot(pivot.new, aes(x = variable, y = User, fill = value)) + geom_tile() + scale_fill_gradient(low = "blue", high = "red")

heatmap 2

Text Mining in R

Now let’s start looking for patterns in the tweets that others post on Twitter. I combined Consolidata’s 715 tweets with the our other connections’ tweets for a combined total of 61,218 tweets. I split them the way I did before to get simple counts, but in this tutorial we will go a bit deeper than just word frequencies. We are going to look for associations between certain terms, and for this we can use the tm library in R.

Here is some code that I used to convert the tweet strings into a readable format for text mining, and then filter out the interesting words:

# Time for some text mining with the tm library. Convert the data into a Corpus, and then perform some filtering functions.
tweets <- Corpus(VectorSource(tweets.split))
# Filter out the interesting words. ----
tweets <- tm_map(tweets, removeNumbers)
tweets <- tm_map(tweets, removePunctuation)
tweets <- tm_map(tweets, removeWords, stopwords("english")) 
tweets <- tm_map(tweets, tolower)
tweets <- tm_map(tweets, removeWords, c("rt")) 
tweets <- tm_map(tweets, stripWhitespace)
tweets <- tm_map(tweets, PlainTextDocument)
# Stage the data by converting it to a DocumentTermMatrix.
dtm <- DocumentTermMatrix(tweets)

Out of interest, what are some of the most frequently used words?

# Find frequently used words that appear over 1000 times throughout all tweets. 
findFreqTerms(dtm, lowfreq = 1000)
#Output
 [1] "amp" "analytics" "big" "bigdata" "business" "can" "data" "datascience" "day" "free" "get" 
[12] "good" "great" "how" "iot" "just" "like" "london" "need" "new" "now" "one" 
[23] "people" "see" "server" "sql" "thanks" "the" "time" "today" "top" "via" "will"
# Get relative word counts.
colTotals <- col_sums(dtm)
word.count <- data.frame(word = names(colTotals), freq = colTotals)
word.count.top15 <- head(word.count[order(-word.count$freq),],15)

topWords15

Now let’s look for some word associations (I believe it’s called basket analysis). Using the function (and its arguments):

findAssocs(DocumentTermMatrix, "word", minimum correlation)

I will find words that are closely associated with some of our network’s most frequently used terms: data, bigdata, analytics, and sql.

# Find associations between certain words from our top 50, to other words.
asso.data <- findAssocs(dtm, "data", corlimit = 0.30)
asso.data
# Output
$data
 big 
0.38
asso.bigdata <- findAssocs(dtm, "bigdata", corlimit = 0.30)
asso.bigdata
# Output
$bigdata
httpstcoxjybdgsa cos invested 
 0.32 0.31 0.31 # It seems that there is a particular link that is strongly associated with the big data term.
asso.ds <- findAssocs(dtm, "datascience", corlimit = 0.30)
asso.ds
# Output
$datascience
numeric(0) # sometimes, there are no terms that correlate with the search term with a correlation value of over 0.30. Oh well, these relationships are weak at best.
asso.sql <- findAssocs(dtm, "sql", corlimit = 0.30)
asso.sql
# Output
$sql
server 
 0.6

Summarise your most frequent hashtags with a word cloud using the wordcloud() function in the wordcloud library (enough wordclouds for you?)

wordcloud

Latest from this author