Analysing Website Data with Graph Databases

#

Knowing how your potential customers and users are navigating through your website is a valuable insight.

You might focus on how long the average user spends on your website, where they go, where they are in the world, how often they visit, where are they arriving from and an almost endless list of other variables.

You might discover which products are selling the most successfully (and which aren’t), or which of your blog posts are the most engaging (and which ones are being ignored). You might even find bugs or inefficiencies within your website that you didn’t know about before.

What is a graph database?

Click here to try out the live D3 example. Scroll over the circles to see the names of the example web pages that we will go through later in the article, and scroll over the lines to see how many clicks occurred between those two pages.

Even though graph databases are ‘non-relational’ in nature, they are particularly useful for storing and analysing data which have inherent relationships with each other. When you see dashboards on websites, containing recommendations that read ‘You might like these films…’, there is probably a graph database in the back end powering it.

It knows you like film a and that you went to see film b at the cinema recently with your friend z. It might also know that friend z  likes a and b, but also went with his friend p to see film c, so why not try c?

By the way, ‘Do you know friend p?’

Neo4j and web analytics

To analyse our website visit data we can use Neo4j, a fairly simple but versatile online graph database which allows you to import data, analyse it and produce some cool visualisations for free.

With Neo4j and all graph databases, you can store your data in three main ways:

  • Nodes – these will be your webpages, or possibly entities like people, products, locations etc.
  • Relationships – something to connect two nodes, like a web visit. Between pages A and B, visit C occurred. Relationships (or edges) can also be bi-directional: Andy <- is related to -> Bill.
  • Properties – you can assign extra information to your web pages or your visits. For example, pages could have properties of name = ‘Home’ and url = ‘www.home.com’, while visits can contain the date and times, the IP addresses and the locations of the people who made them.

Cypher

The main challenge is to import all of this data from a relational database into a non-relational format that Neo4j can understand. It uses a declarative querying language called Cypher, which if you are experienced in writing in SQL you will notice a lot of similarities between the two. It is very easy to learn and to write in, so don’t be frightened to try it yourself (even if you are not a technical person – it is quite reasonable to learn from example queries alone!)

Here is an example query:

The word MATCH replaces the word SELECT for SQL users, to let Neo4j that you are looking for a pattern in your database rather than specific columns. You might give that pattern the alias of a to make your querying easier (so you can count how often a pattern occurs in your data).

The WHERE filters out patterns with any specific relationships that are interesting to you…

…and the RETURN statement will work to output either a list of results, or if possible, an animated visualisation like this:

The data

What your website data will look like will depend on which website you use to record all your click data, but this is a simplified Excel version of what a ‘visit’ might look like. The columns are quite obvious, however we can generate a rank column that gives us some extra information about what stage of the visit each click occurred. For example, a rank of 1 = the first click.

Remember that Neo4j can store data in the form of  nodes, relationships and properties, so you will want to spend some time deciding exactly what you want to include in your database as what.

Here we have some fake webstats data with 13 pages and roughly 550 clicks over the space of a month, so we will store our webpages as nodes and our click data as relationship properties. We are not really using a graph database as it was originally intended, but all will be revealed shortly.

To get your data in, you will need to tell Neo4j what your web pages are along with any important properties that they may have, and what clicks occurred between them in separate strings.

You can see that I have written separate code for the nodes as (:Page)’s and the relationships that occur between those pages as [:CLICKED_TO]. With a solid stored procedure in SQL, you can convert these clicks into multiple strings in Cypher that can be fed  into Neo4j’s query window.

Providing all of your code is error free, the end result is a bit crazy, but also really cool. In fact, to me it is eerily similar to a network of neurons:

Prune away the less useful clicks, such as the ‘Page not found’ instances or users refreshing the page, you will start to see some order in the chaos. A bit of careful dragging and dropping of your nodes will get you started. It seems here that the majority of clicks are occurring between the Home page, the What We Do page, the Team page, and the Platform page.

Of course, a real analyst can check and confirm using some Cypher queries (I have exported the data tables to Excel for presentation purposes).

‘What were the first four frequently visited sites on the website?’

‘First two clicks?’

Or ‘how do customers arrive at the checkout?’

You can even filter your results by all of the properties that you coded in originally, such as by: city, location, time, browser history, whatever data you have available.

Data science and Consolidata

There are far more rigorous analyses you can do alongside these. The Neo4j community edition and the Cypher querying language is great for providing a snapshot of the flow of traffic on a website, but is not without its limitations. The standard analyses such as frequency of each page visit, analysis by geographical location, gender analysis etc might best be done in SQL or even in R Studio, which makes drilling down into your data faster and easier in some cases. You can also only analyse fairly small amounts of data without upgrading.

You might also find that building a profile of your customer, based on their browsing habits is a more valuable exercise. Placing customers into clusters based on what pages they spend the most time on, can be used to predict where a user is likely to go next during a visit to the website, so you can target specific content more effectively. Our co-founder Gordon Meyer will be publishing a blog post discussing the cluster analyses he performed as part of his Business Intelligence M.Sc. at the University of Dundee, using real data from the Financial Times website. Read more about Gordon and the rest of the team here.

Or you might want a cleaner or more functional visualisation to take away to your marketing and IT teams. Neo4j gives you the option to export your query results as JSON files which can be visualised using an online library.

Here is a snapshot of a D3.js example I created off of a small section of the website click data (try it out here). I have mapped the width of the lines to data values so that they are proportional to the number of clicks that occurred between those two nodes (so thicker lines means more traffic) and assigned different shades of blue according to the level of the page in the website hierarchy. You don’t have to use D3.js either – there are a number of different options you can go for when choosing to make visuals, like Node.js or Alchemy.js.

Graph databases are scalable, fast for storing and analysing associative data and have a lot of potential for some really extravagant and professional looking visualisations. And you don’t need to be a technical genius to do some simple but effective things with them.

Latest from this author