Stay up-to-date with the latest business and accountancy news: Sign up for daily news alerts
Data analytics community
Author: Dimitris Kaskadanis & Polyna Christakou
Published: 22 Sep 2022
In the new era of digitalisation, there are emerging challenges relating to the new technology risks while fraud risk increases, adopting new and innovative methods in line with technological progress. In this environment, audit should develop data analytics skills that are beyond traditional risk monitoring and fraud detection tools to meet stakeholders’ evolving expectations.
Clustering is an unsupervised Machine Learning (ML) algorithm (i.e. an algorithm that learns and improves from experience, without input from users) that looks for patterns in data by dividing it into clusters. These clusters are created such that the points are homogenous within the cluster and heterogenous across clusters. Clustering is commonly used in market segmentation and several areas of marketing analytics as well as in fraud detection.
This is an example of an aggregated table where we have filtered out the null values of the ‘credit limit’ column (one case as shown above):
Power BI offers various clustering graphs, to assist in understanding how the model aggregates similar characteristics into clusters; the one that is most often used is the scatter plot which shows the relationship between two numerical values.
Additional visualisation is offered under the ‘get more visuals’ section – right-click on the three dots
Some useful visualisations include:
Scatter plot (included in default visuals)
Clustering helps us interpret a scatter plot. We selected to plot balance (as an independent variable on the x-axis) and purchases (as a dependent variable on the y-axis). We then opted for ‘Automatically find clusters’, that creates the graph below.
It seems that the clusters are grouped based on the volume of the balance, i.e. the amount left in the account to make purchases.
Clustering graph
Apart from the scatter plot visual that allows clustering and is included in the default visuals of Power BI, other clustering visuals can be found through the ‘get more visuals’ section.
We selected the ‘Clustering’ graph, as one which is Microsoft-developed. This visual uses a well-known k-means clustering algorithm. You can control the algorithm parameters and the visual attributes to suit your needs.
We note that the specific graph (as well as other graphs that run ML algorithms) require the installation of R studio.
In both clustering graphs, Power BI automatically chose to classify data under 3 clusters. However, the user may opt for a predefined number of clusters, following a methodological assessment like the ‘Elbow criterion’ or the ‘Silhouette coefficient method’.
Outliers Detection graph
The outliers are those data points that are away from regular data points. As you may observe the outliers are the same in both charts above and are these data points with higher purchases that are away from the main population and do not form a tight cluster.
We also employed a graph to confirm our understanding of the outliers noted through the clustering process.
The ‘Outliers Detection’ graph – which again you can find under the ‘get more visuals’ section – can be used to confirm understanding of outliers noted through the clustering process. Here, it denotes the blue-coloured data points as the inliers and the red-coloured as the outliers.
Power BI gives the user a good overview of this graph. In particular, in this Custom Visual, we can implement one of five popular detection methods: Z-score, Tukey’s method, Local Outlier Factor – LOF method, Cook’s distance, and by manually defining upper and lower thresholds.
We selected the Cook’s distance as this is the most commonly used diagnostic statistical value.
The Institute of Chartered Accountants in England and Wales, incorporated by Royal Charter RC000246 with registered office at Chartered Accountants’ Hall, Moorgate Place, London EC2R 6EA