With the recent release of the new Qlik Sense September 2020 update, we got some really great new functionality. One of the latest additions that got us most excited was the ability to take advantage of the k-means clustering algorithm. So, we wanted to see if we could bring this capability to QlikView as well. To do this, we had to take advantage of Qlik’s powerful scripting functions, and we thought it’d be awesome to share our experience with you all. So, let’s jump right into the action, shall we?
Implementing K-Means in the Qlik Backend
First thing’s first, we cover what k-means clustering is and also presented some interesting use cases for it in our previous article. If you’re not familiar with the algorithm, then we strongly advise you to give it a quick read through.
Our implementation went through the following steps:
- Determine the size and diversity of the data at hand
- Looking at the data, we can come up with a hypothesis for the optimal number of clusters (K)
- Determine the extremums in the data (highest and lowest points in our n-dimensional space)
- Place a ‘random’ number of centroids within your data for each cluster
- Calculate the distance between each data point and each centroid
- Remove the old centroids and place new ones in the new averaged out cluster center
Now let’s break each of these steps down in a bit more detail.
Size of the data
Generally, it’s considered a good practice to be familiar with your data. This way, you’ll be able to set realistic expectations for the outcome of any processing you may want to put it through. In this case, especially, it’ll be easier for you to determine the number of clusters you can group up your data in.
Optimal number of clusters
The larger the number of clusters, the higher the chance of “overfitting” your data. Typically, the number of clusters shouldn’t be bigger than √n, where n represents the data points in your dataset. Suppose you’re unsure about what constitutes an optimal number in your case. In that case, you can try creating a function that would determine the “absolute change” for the average distance between individual points and the centroids of the clusters they belong to. That function, however, requires you to brute-force a significant number of scenarios and may be resource-heavy. Once the effectiveness of increasing the number of clusters starts becoming minimal, then there you’ll have your optimal number of clusters. That approach is also known as “the elbow rule”.
The example we prepared is set with 2-dimensional data, but the math doesn’t get much more difficult for n-dimensional data. This makes it easier to have the process of cluster identification visualized. In this scenario, we need to identify the x-values for the 2 points at the far right and far left, along with the y-values for the 2 points at the furthest top and bottom. These coordinates combined are sufficient for us to draw a rectangle, which will contain all the points we will be clustering.
With the help of a “for” loop, we generate the initial centroids in a semi-random manner, which we store in an empty table. For the initial random placement, we chose to pick points on one of the rectangle’s angular bisectors we described in the previous step and placed them all on equal distances between one another and within the rectangle itself. Doing this decreases the number of iterations the code has to go through before it reaches the final cluster state. Essentially, saving computational resources. We’re writing down all centroids’ coordinates in a table so we will be able to bring them back and see how they changed through the course of each iteration.
We calculate the Euclidian distance between each point and each centroid. That is achieved by doing the Cartesian product of the sets of all points and all centroids, then calculating the distance between any two. Now we can form the clusters by using INNER JOIN to get only the lowest distance values, aggregated per point.
New Centroids Placement
Now that we have identified the new clusters, we need to calculate the center of mass (average point) for each cluster individually. These new centers of mass will be the coordinates for the centroids we will place on our next iteration. We repeat this process until we have achieved the optimal state – the one after which any number of iterations will not change the centroids’ coordinates. The type of loop we need to use for this would be a “DO WHILE” since we need it to exit the repetitions when certain conditions are fulfilled.
All of this may sound a bit complicated, and it is to a degree, but there’s no denying the satisfaction you get from watching Qlik just paint on your screen with the colors of victory. Just check out the GIF below and see for yourselves.
As you can see, despite being out on the market for a very long time, QlikView is still a very capable software. We had loads of fun while implementing this and can’t wait to try our hands at another machine learning algorithm. B EYE can help you take greater advantage of your data, so fill in the contact form below, and we can get started with improving your business.