Recently, Consolidata were approached by a fast-growing internet company, acting as a ‘digital wallet system’ for customers sending money to other users’ phones in the UK and abroad. These transactions were being carried out by local merchants (small shops and newsagents) in areas with a high proportion of immigrant residents. Their aims were to:
- Identify which merchants were performing the best (or worst) in volumes of sales and the value of sales.
- Map the locations of these merchants and identify any external factors that might be driving (or hindering) their success.
- Predict potential new areas of business in order to better focus marketing resources.
Consolidata’s role was to produce a demonstration of this, and to do this without real company data to work from. Something I have learnt is that this is more common than I expected before venturing into big data. Data is valuable, after all. Ultimately, this was to require a more creative approach than simply analyzing some obscure raw data source.
The data I settled with is a little unusual – in fact, I decided to use UK land registry data (specifically price paid data) from the gov.uk website. Because the price paid data contains data on location, price, nominal data and date data, perfect for our demoing purposes, with a bit of sampling, relabeling and preparation we can create a sample of data that looks and behaves like transaction data, without having to manipulate the data too drastically.
Understanding that London is one of the most diverse cities in the world, and contains areas with high proportions of immigrants and/or ethnic minorities, we decided to narrow our focus to transactions that occurred in London. After careful relabeling, we ended up with a data set that contained:
- Transaction value
- Transaction costs
- Gender information
- Geographic data
- Whether the transactions involved sending money abroad or not.
Although the above chart was formatted in Excel, I utilized a fairly wide range of technologies for the initial data analysis:
- R and R Studio
However, for our case study, I used some basic R functions to get a quick summary of the data and produce some simple tables with counts. When analyzing the date data, I was able to quickly produce some simple breakdowns of the total transactions by day of the month, as well as by the month itself, and produce very clean visuals from them.
Above: Some simple outputs from R, converted into Excel tables and plots for show. I also used the summary function to get an overview of the average value, cost and revenue gained per transaction, the standard deviations and what the min and max transaction values were, as seen below.
Things get a bit more interesting when you start to utilize R and R-studio for its packages and predictive power. Using a combination of libraries that can create decision trees and bring them to life graphically, you can create a predictive model that identifies which factors are the most significant in predicting whether a transaction will send money abroad or not. This can actually provide a base for machine learning – by refining this model over time, we can become more accurate in identifying the significant predictive factors and identifying to what extent they can predict an outcome. The output can resemble something like the decision tree shown below.
In our scenario, the biggest predictor of whether a transaction sent money abroad was the time of year the sale occurred. If it occurred after mid-June, then it had a 86% of sending money abroad. So sadly nothing particularly interesting came from the initial prediction, but it provides a starting point for better and more powerful algorithms to be built upon. Algorithms define action, so if any significant factors exist (i.e. whether the customer is a male or female, where they are living, their nationality etc) will drive the next move for targeting new customers.
library(rpart) # the main library for forming decision trees. Also use rattle, r.part.plot and RColorBrewer for something fancier. str(data) # get an overview of what the data is summary(data) # get some basic summaries. I also use describe.by() from the psych library. table(data$columnA, data$columnB) # create a quick and easy to format table from two variable names and get basic counts.
SQL Server 2014
For more detailed and deeper insights, SQL Server rarely fails to deliver. By importing CSVs into SQL and performing a bit of data cleansing, I was able to run some basic queries and get out what I needed fairly quickly. This included breakdowns of gender, insights into where money was being sent and the average value per transaction per location. While this could have been done in R as well, I opted for using T-SQL since it was a bit quicker for me to write some very similar SQL queries. The R alternative is finding the correcting functions and packages for each job and formatting the data into something user friendly.
Above: using an Excel stacked bar chart illustrates nicely the differences between London locations in terms of transaction volume, revenue and costs.
Below: how these values changed over the course of the year, plus how the number of transactions sending money abroad changed over the year.
I have began to hear more and more positive things about Power BI 2.0. It has been improving steadily in response to user requests and adding features such as the ability to run R scripts. For our example, Power BI came into its own for its ease of use, especially if you have never used it before. One can map a number of factors onto ready made visual templates. The standard bar, line, scatter and pie charts are all there, but Power BI allows you to map your data to Bing maps.
In a separate exercise, I wrote a web-crawler that scraped Western Union merchants’ postcodes from the Western Union website, into a list of around 300 merchants across London. After assigning some random data to each merchant to build a table, I began to build a dashboard. In the visual below:
- The size of the circle represents the number of transactions that occurred at that merchant. The larger the circle, the more sales that occurred. However, size alone does not give a complete picture of merchant success.
- The saturation of each circle represents a ratio of the number of transactions to the value of each transaction. Darker circles represent a higher ratio of transaction value to transaction cost.
Above: The larger, darker circles indicate the most successful merchants. Small dark circles represent merchants who process valuable transactions, but in smaller numbers. Larger, lighter circles process a lot of transactions, but either costs are too high or the value of the transactions are too low.
The dashboard is not static, either – Power BI even allows you to interact with these circles and scroll over them for more information that you may have stored about that merchant: who owns it, the contact details of the vendor or when the merchant opened. You might even choose to highlight areas on the same chart where specific ethnic groups or demographics are more concentrated and identify potential new areas for setting up merchants.
How might one do that?
Using global remittance data to identify where money being sent from the UK is likely to go, you can discover which customers bases are the best targets for future sales. If the most money is being sent to Kazakhstan, for example, then it makes sense to find densely populated areas of London where the most Kazakhs are located and set up a merchant there. You can join different data sets like I did to calculate the relative proportions of transfers from the UK to a certain country and compare it with the global average.
Above: Creating a new metric that can predict potential new customer demographics to target.
Big data is not valuable on its own. The value you can obtain from it depends on:
- What tools are available and the skills available on the platform to use them.
- The processes operating on the platform to process the data.
- And the quality of the algorithms that are driving future business decisions.
The best data platforms don’t just build new digital processes on top of existing business processes, making sense the backlog of historical data – they also build algorithms for the future. Algorithms that will be flexible, will change with experience, and generate internal and external predictions. Algorithms define action. And a combination of technologies will be needed to develop them.
These types of approaches to data processing are the future. They are using multiple technologies to gain insight and learn from experience. But the kicker? Is that we will start to produce insights that no one even expected in the first place.