Twitterbot in R and Neo4j - Loading Data (Part 2/3)

#

Introduction

Revisit part one if you haven’t already, you will also need to install Neo4j on your local machine and have the server running in the background to be able to access it from your R console. You can view the full code in this GitHub repository, including a sample CSV file (I will also attach snippets here).

Today’s blog will cover:

  • How to connect R and Neo4j using RNeo4j
  • How to extract the data you need from the REST API output and convert it into usable data frames.
  • And how to use the commands from the RNeo4j to import your data into a graph database and manage it effectively, including avoiding duplicated nodes and improving performance.

Extraction

Once you have gotten authenticated, we can start collecting some data. We will use twitteR to get out the data we want. Today we can focus on getting tweet data, user information and encoding some relationships between them. I set up my environment with the following code:

# install.packages("RNeo4j")
# installpackages("twitteR")
# install.packages("httr")
library(twitteR)
library(RNeo4j)
library(httr)
# set working directory.
setwd("path/to/dir/")
# create a connection between R and Neo4j via the RNeo4j package.
graph <- startGraph("http://localhost:7474/db/data", username = "neo4j", password = "neo4j")
# add some indexes and constraints to the graph database before we create nodes, so that we don't duplicate nodes and to speed up our searches later on.
addConstraint(graph, "Tweet", "id")
addConstraint(graph, "User", "screenName")
# get data from the ConsolidataLtd profile.
con <- getUser('ConsolidataLtd')

If you look in your environment (if you are using RStudio) you will see that the Consolidata profile information is not yet in a data frame. This is because the data is returned as JSON, and R’s way of reading this in is to convert it into a list, because the data sits at different levels within this JSON structure. Fortunately, you can access this data quite easily, even in a list format. You can even try something like this to get a followers count:

con$followersCount

You can also reference an index of inside a list of tweets (in this case, the first list item) and by typing $ you can see what items are available. Try running this command to see a list of all of the messages that people have tweeted:

tweets <- userTimeline('ConsolidataLtd', n = 1000, includeRts = F)
lapply(tweets, function(x){x$text})
# Output
[[1]]
[1] "Using #opendata, @OrdnanceSurvey has created a map of Mars! Plus other curious visualisation stories in this post - https://t.co/SopVIHlTQU"
[[2]]
[1] "Pandora's (Apple's) box - https://t.co/kOFEVcfviA #apple #fbi"
[[3]]
[1] "Your Friday fix from /r/dataisbeautiful - https://t.co/oIiipCYecb."
[[4]]
[1] "Good suggestions for #DataScience #Learning at https://t.co/mHuSyskKNR by @Mitch_Crowe"
[[5]]
[1] "The latest findings from @pewresearch -https://t.co/ymOkLyF1yo"
...

If you want everything in this list converted straight into a data framesimply use:

df <- twListToDF(tweets)


twitterbot 2 1

Otherwise, by using lapply we can apply a function to every tweet in that list and extract just the data values we need, without having to worry about NULLS or trim away less useful information. After using lapply, the output is returned as a list still, so we can nest it inside an unlist() function to get a vector instead. Once these values are converted to vectors, we can start creating tables and data frames.

I have created an example below with some tweets about the triumphant return of TV’s SuperTed:

users.screenName <- unlist(lapply(tweets, function(x){x$screenName}))
tweets.id <- unlist(lapply(tweets, function(x){x$id}))
tweets.text <- unlist(lapply(tweets, function(x){x$text}))

table(tweets.screenName) # get a count of the number of tweets per person
st.users <- data.frame(handle = users.screenName) # create a data frame with a custom column name.
st.tweets <- data.frame(id = tweets.id, tweet = tweets.text)

twitterbot 2 2

Getting Our Twitter Data

How do we get more data other than just tweets? What if we want to find Consolidata’s followers? And find out more about them? We can use some of the methods provided by the twitteR package. They are accessible from your Consolidata variable like so:

# start building followers and following into your model.
followers <- con$getFollowers()
followers.df <- twListToDF(followers)

I use these additional methods to get as much useful information out as I want to play with. Here is the code with some comments about the methods I used:

friends <- con$getFriends()
friends.df <- twListToDF(friends)

favorites <- con$getFavorites(n = 100)
favorites.df <- twListToDF(favorites)

Loading Data Into Neo4j

If you are new to Neo4j and to graph databases, then you can read a little bit about them here. Graphs are useful for modelling relationships between entities and seeing how people are connected. Social data lends itself very nicely to this, as in Twitter’s case, a users can be related in a number of ways. They might follow one another, or have favorited a tweet, or have replied to them via another tweet. We can create something quite structured, with clear relationships between tweets and users, like so:

twitterbot 2 3

Try it yourself:

# browse the graph in your R viewer screen.
browse(graph)

We are going to store our data in the form of nodes and relationships.

  • Nodes = a physical thing. An entity, like a user or a tweet. It is the primary store of data, such as the text a tweet contains.
  • Relationship = a connection between two entities, or an ‘edge’ in graph terms. For instance, the relationship between two users can be User A FOLLOWS User B. Information can be stored on a relationship as well, such as the date a tweet was tweeted.
  • Properties = can sit on either a node or relationship.

You can use a range of tools to create your graph. I will use the createNode() function in RNeo4j to create my nodes, and use the LOAD CSV writing clause in Cypher to generate the relationships from a CSV. Here is some example code I used to create a Consolidata node, some tweet nodes and then create a CSV containing relationship information, connection the Consolidata handle to tweet ids.

# create a Consolidata node in the graph database, and reuse it later.
Consolidata <- createNode(graph, .label = as.character("Home"), screenName = as.character("ConsolidataLtd"))

# create Tweet nodes with a function
lapply(1:nrow(tweets.nodes), 
 function(x){createNode(graph, .label = as.character("Tweet"), 
 id = tweets.nodes$id[[x]],
 text = tweets.nodes$text[[x]],
 created = tweets.nodes$created[[x]],
 favCount = tweets.nodes$favCount[[x]],
 rts = tweets.nodes$rts[[x]])})

# create relationships between the Consolidata node and tweets using LOAD CSV.
write.table(data.frame(START_ID = tweets.df$screenName, END_ID = tweets.df$id),"path/to/file/consolidata-tweet-rels.csv", sep = "|", row.names = F, col.names = T, quote = F)
cypher(graph, "LOAD CSV WITH HEADERS FROM 'file:/path/to/file/consolidata-tweet-rels.csv' as rels FIELDTERMINATOR '|' MATCH (n:Home {screenName: rels.START_ID}) MATCH (n1:Tweet {id: rels.END_ID}) MERGE(n)-[:TWEETED]->(n1)")

Now you will be able to see your lone tweets inside your Neo4j database by going to the browser. Try this shortcut to browse in your R viewer:

# Cypher code - run this in the console in the Neo4j browser.
MATCH n RETURN n LIMIT 100

twitterbot 2 4

Following the script in my GitHub repository linked at the top, you can eventually link everything together nicely and cleanly. Because you have applied constraints, duplicate nodes are not created when the graph scans for users that both follow and are followed by Consolidata, for example.

You can start to clean some basic insights like who are the users that Consolidata engages with the most?

cypher(graph,
"MATCH (a:Home)-[:FAVORITED]->(b:Tweet)<-[:TWEETED]-(c:User)
RETURN c.screenName as User, count(c) as `Number of Tweets Liked`
ORDER BY `Number of Tweets Liked` DESC
LIMIT 10"
)
# Output
 User Number of Tweets Liked
1 JaneMLee 5
2 davebally 4
3 pinaldave 3
4 SqlBrit 2
5 TheSurrealFish 2
6 jenstirrup 2
7 WDayRay 2
8 simon_sabin 2
9 dafyddbiffen 1
10 tonyrogerson 1

 

Some aggregation to find out what the average number of tweets tweeted by each of your followers over time:

cypher(graph,
"MATCH (a:Home)-[:FOLLOWS]->(b:User)
WITH count(b) as fols, sum(b.statuses) as stat
RETURN stat/fols as `Mean Tweets Per Follower`"
)
# Output
 Mean Tweets Per Follower
1 2782

Or count words in popular tweets:

good.tweets <- cypher(graph,
 "MATCH (a:Home)-[:TWEETED]->(b:Tweet)
 RETURN b.id as id, b.text as text, b.favCount as favs
 ORDER BY favs DESC
 LIMIT 20")
all.words <- c()
lapply(good.tweets$text, function(x){all.words <<- c(all.words, tolower(strsplit(x, " ")[[1]]))})
View(table(all.words))

twitterbot 2 5

The third part of this series will look into the kinds of analyses you can do in more depth, including how to visualize your graphs.

Latest from this author