Visualising the Rubik's Cube Community with R and Neo4j

#

This is something I have been meaning to revisit and write about for quite a long time. For a while I have been jealous of those in post in this thread. With access to the WCA database, you could find out more or less anything you want about the competitive speedcubers, and my lack of programming knowledge held me back for a long time.

Now that I am lucky enough to have had the chance to experiment with a few different techniques, I thought I would share the method I use for analysing the WCA export data. From what I’ve seen, the majority of people posting in that thread use either SQL, Python or Ruby to do the bulk of their analysis.

R

To prepare the data I use R over SQL…mostly. The script I use to get the data into data frames is really quite simplistic and looks something like this:

# set your working directory to where your WCA export files are. 
setwd('D:\\wca')
files <- dir('D:\\wca', pattern = '*.tsv')

# Loop through all the TSV files in my working directory and create data frames, using substrings of the file names as variable names.
for (a in files){
 assign(gsub('.tsv','', substr(a, 12, 100)), read.delim(a, header = T, sep = '\t'))
} 

R is particularly good for getting some simple descriptive statistics and plots as soon as you have created your data frames. For example, simple counts:

# number of competitions by year
data.frame(table(Competitions$year))
# the relative proportions of speedcubers by country and how to order them.
df <- data.frame(prop.table(table(Competitions$countryId))) 
df[order(-df$Freq)]
# merging files together, and filtering them by criteria.
df <- merge(Competitions,Results,by.x = 'id', by.y = 'competitionId') # merging data
data.frame(df$year,df$month,df$day,df$best)[df$regionalSingleRecord == "WR" & df$eventId == "333bf",] # filtering data

And some decent plots, without too much effort:

# Persons belonging to the top 6 countries by number of competitors:
library(sqldf)
df <- sqldf("SELECT * FROM Persons WHERE countryId IN('USA','China','India','Brazil','Poland','Canada')")
# show proportions of males vs females vs not identified per country. 
with (df, table(gender, countryId))
ggplot(df, aes(x = countryId, fill = gender)) + geom_bar()

ggplot

As you can see in the example, I even used SQL to write some queries in R!

Neo4j

The advantage that Neo4j has over these is the ease of modelling of relationships between entities, like competitors and the competitions they’ve been to, and how they might be related to other speedcubers. The language used (Cypher) is nice to work with – it is an expressive language and has a number of similarities with SQL in terms of its syntax and the way it is evaluated. For example, to get an aggregation of the number of competitors by year:

MATCH (a:Cuber)-[r:COMPETED_AT]-(b:Competition)
RETURN b.name as Competition, COUNT(a) as `Number of Competitors`
ORDER BY `Number of Competitors` DESC
LIMIT 7;

The MATCH clause tells Neo4j to look for a pattern in the database. In this case, that pattern is consists of two nodes, one with a Competition label and the other has a Cuber label, joined by a COMPETED_AT relationship. The database is scanned for joins where this pattern occurs, and from these matches the Competition name (b.name) and a COUNT of the number of competitors attached to each node, is returned.

This returns output like so:

CompetitionNumber of Competitors
World Rubik’s Cube Championship 2013 584
China Championship 2015 514
US Nationals 2014 499
US Nationals 2015 451
World Rubik’s Cube Championship 2015 430
World Rubik’s Cube Championship 2009 331
World Rubik’s Cube Championship 2011 297

The fun part comes when you want to express something more than just a table or a data frame. For example, this query:

MATCH (n:Cuber)-[r]-(n1)
WHERE n.name = 'Oliver Frost'
RETURN n,n1

Returns a graph of all the competitions I have ever been to:

of

Although, if I want odd/weird statistics then I would not necessarily do all my querying in Neo4j. What I would use it for is to examine the relationships between entities.

For example, what is the shortest distance between me and cuber x, going via other cubers and the competitions we’ve been to? I picked myself (obviously) and a new Chilean speedcuber who went to his first competition this weekend, and ran this query:

MATCH p = shortestPath((a:Cuber{name:'Oliver Frost'})-[*..8]-(b:Cuber{name:'Gabriel Ignacio Fernandez Albarracin'}))
RETURN p

path

  • Oliver Frost COMPETED AT Welcome to Cambridge WITH Christopher Bode
  • Christopher Bode COMPETED AT Valpo 2015 WITH Domingo Vio
  • Domingo Vio COMPETED AT Campeonato Chileno de Verano 2016 WITH Gabriel Ignacio Fernandez Albarracin

Another use for Neo4j is measuring the shortest maximum distance between any two cubers. Using a minuscule cubing community, Andorra, as an example. I ran this code:

MATCH p = (n:Cuber{personId:'2014AMAT01'})-[*..4]-(n1:Cuber{personCountryId:'Andorra'})
WITH collect(p) as paths, max(length(p)) as max
RETURN FILTER (path in paths WHERE length(path) = max) as output

…where it turns out that every Andorran cuber is a maximum of two competitions away from any another Andorran cuber, and it is possible to demonstrate with a visual:

andorra

There are only four Andorran cubers in the whole database. When you show cubers’ nationalities instead of their names or IDs, you can see that all four Andorra are accounted for over two  competitions. Therefore, maximum distance is four nodes, or two competitions.

Unfortunately, this is where I started to run into problems with Neo4j (or possibly my lack of knowledge?) Even after allocating more memory to the Java Virtual Machine on which Neo4j runs, tuning the database, adding indices to my nodes and transferring my database to a Consolidata machine with 60GB RAM available, I still had problems executing larger queries.

However, there is a way.

RNeo4j, visNetwork, iGraph

Here is my amateurish brute force method of finding the shortest distance between myself and any other cuber in the world. I use a Neo4j driver based in R called RNeo4j, and the following code runs a shorest path algorithm against myself and every other speedcuber in the world. After indexing the cuber IDs and running this script:

library(RNeo4j)
setwd('C:\\OF')
file <- dir('C:\\OF', pattern = '*Persons.tsv')
Persons <- read.delim(file, header = T, sep = "\t")
graph <- startGraph("http://localhost:7474/db/data")
collect.lengths <- c()
me <- "2012FROS01"
for (a in Persons$id){
# I used [*..10] to tell Neo4j to check depths of up to 10 nodes. This wasn't necessary...
query <- paste("MATCH p = shortestPath((a:Cuber{personId:'", me ,"'})-[*..10]-(b:Cuber{personId:'", a ,"'}))RETURN length(p) as l", sep = "")
 length <- cypher(graph, query)
 collect.lengths <- c(collect.lengths, length$l)
}

# To give a more meaningful result.
table(collect.lengths/2) 

I returned this result after about 20 or so minutes:

DistanceNumber of competitors at that distance
1 732
2 27318
3 21289
4 386

And this is for every WCA ID in the export. Now we can create a table of every speedcuber and my distance from them, and find how who are the furthest cubers:

length.df <- data.frame(Persons$id[Persons$id != '2012FROS01'],collect.lengths/2)
colnames(length.df) <- c('WCA ID','Distance')

table

Here is one of the paths with a length of 4:

longestpath

 

The thing I love the most is the level of integration between R and Neo4j. There are a number of packages available in R that allows you to connect to a working graph database, query it, write to it and visualize it, all from the R console. RNeo4j is the best (and only?) package for this, created by Neo4j data scientist Nicole White (I was inspired by inspired by this talk), and I use it in order to:

  • Create nodes and relationships
    • Basically, to build a graph databases.
    • I either use the createNode() or createRel() for simple models, or I send a query to the graph database in the form of
LOAD CSV WITH HEADERS FROM "file:/C:/path/to/file.csv/ as x
CREATE (:Node {name: x.NameOnCSV, age:toInt(x.AgeOnCSV)})
  • … and repeat for small-medium volumes of nodes or relationships that I have created on the fly in RStudio.
  • To query data structures.
  • To visualise data:
    • Neo4j makes great visuals for smaller networks of nodes.
    • However, producing an image of anything larger than a few hundred nodes slows the browser right down.
    • Using the visNetwork and iGraph packages in R can provide a compromise.

Using the iGraph and RNeo4j packages, you can produce some PNG image ‘snapshots’ of a graph database. You cannot see every node or interact with the graph, but it can give you an idea of what is going on in your networks.

For example, here’s a snapshot of every UK competitor and the competitions they’ve been to. Sadly, you cannot distinguish between cuber and competitor nodes, but here we go: allukcubers

library(RNeo4j)
library(igraph)
graph <- startGraph("http://localhost:7474/db/data")
query = "MATCH path=(n)-[r]-(m) WHERE n.personCountryId = 'United Kingdom' RETURN n.name as n, m.name as m,count(path) as counts"
data <- cypher(graph, query)
g = graph_from_data_frame(data)
V(g)$size = 3
V(g)$color = "steelblue"
V(g)$label = NA
plot(g,edge.arrow.size = 0.3)

And of every female cuber in the world:

femalecubers

query = "MATCH path=(n)-[r]-(m) WHERE n.personGender = 'f' RETURN n.name as n, m.name as m,count(path) as counts" data <- cypher(graph, query)
g = graph_from_data_frame(data)
V(g)$size = 5
V(g)$color = "orangered"
V(g)$label = NA
plot(g,edge.arrow.size = 0.5)

For me, what is interesting about the female cubers plot is that there are a greater number of isolated cuber->competition clusters (shown in the outer edges). What this suggests is that there are a lot of female cubers going to competitions with very few to no other female cubers, and do not get the chance to network as much with other females. This is just one possible explanation, I’m sure there are many others.

And the whole cubing community visualised:

world

I exported this from RStudio as a PDF (albeit a very large one) which will plot the same graph with SVGs instead, creating a much nicer output.

Let me know if I have made any errors or if you would like to know anything about Neo4j to help me with my maximum lengths query!

Latest from this author