Forecasting the COVID-19 Crisis with Density Clustering

Forecasting the COVID-19 Crisis with Density Clustering

By Stiliyan Neychev

Forecasting the COVID-19 Crisis with Density Clustering

By Dimitar Dekov

December 15, 2023

During these challenging times, it’s comforting to know that there’s so much useful data, which can assist with the important decisions every country or organization needs to make. A large amount of that data is publicly available, which means that all you need is the right tools and mindset to start using it in beneficial ways. That’s why, to assist our clients better, we developed an app that collects all the COVID data and visualizes it in multiple reports. If you’re curious to know what exactly we’ve bundled up in there, then read on while we continue to tell you …

What’s inside our COVID dashboard?

Well, in order to provide a wide range of information from various angles, we’ve grouped it into several different screens. Starting it all off with the typical summary sheet, which mainly shows the daily COVID data around the globe, things like daily infected, cured, and deceased will be found here. Following that, we’ve got a sheet dedicated to providing an economic impact projection, where we can observe several scenarios of COVID progression and the mark they’ll leave on the economy.

As you’d expect, we’ve got a sheet that provides info on the measures each country has taken to stop the spread of the virus and another sheet, which shows the daily number of COVID tests being made worldwide along with the positivity rate indicator. We’ve also utilized Google’s community mobility data and created a sheet that shows the change in people visiting workplaces, grocery shops, parks, etc. It’s interesting to see how the epidemic has changed our everyday life and how heavily enforced the measures are in most countries while others are taking a more relaxed approach.

All of the visualizations so far have utilized the publicly available data to provide a worldwide view of the epidemic. We also have a few sheets focused on the situation in the USA. You can even drill down to the county, city, or hospital level with details about free ICU beds, patients on invasive ventilation, daily rates of new COVID patients hospitalized, patients being dismissed, and many more key indicators. This transitions well into one of our advanced sheets. Why is it advanced, you may ask? It’s set up to assess the risk level in every state in the US with the help of a density clustering algorithm. With all that said and done, let’s take a closer look at this advanced sheet and how it works.

Retail and Recreation location visits

How visits to retail and recreation locations have changed;
The red colors represent a decrease, while the blue/green shows an increase in activity.
Hospital bed utilization in LA County, California

The size of the bubble represents the total beds available in a hospital;
The bubble color depends on the bed utilization rate – the red color is high, blue is low.

What’s Density Clustering?

The title says it all, really – it’s an algorithm, which finds high-density clusters within your data set and considers the rest as noise. We’ve taken advantage of one of the most popular clustering methods – hierarchical density-based spatial clustering of applications with noise, or HDBSCAN for short. So, without going into too many details, let’s quickly explain how this works.  

The process of HDBSCAN can be summarized with three simple steps – estimate the number of densities, pick the areas with high densities, add together the points in the selected regions. In order to assess the density around the data points, we check how close are, for example, 7 of its neighbors – if the distance is high, then the density is lower, and vice versa. This distance is also called “core distance”, and it’s what makes the method density based. Once we have the dense parts figured out, we can easily follow up with the other two steps.

Notice, though, that we chose to seek out the 7 nearest neighbors with our example, but that number can be whatever we wish. It can even be a complex formula. This adjustment potential lies at the core of the HDBSCAN algorithm since you can fine-tune the densities you’re looking for and rule out everything else as noise. Of course, this also means that it’s harder to get everything just right, but let’s show you what we managed to achieve with it.

You may also like: K-Means Clustering with Qlik

How does the algorithm help us?

As mentioned previously, one of our advanced sheets is set up to determine the risk level of all US states. With it, we’re aiming to predict where this COVID crisis is heading so we can assist the health care system in getting it under control and not getting overloaded. In order to define the risk factor, we’ll be utilizing these two components – population risk, which looks at people over the age of 65, people who have heart problems, diabetes, etc.; and epidemic risk, which observes the speed of infection, death rate, etc.

All this will be much clearer with an example, so let’s check out which states have more people in the risk groups. In this case, we’ll view the number of people with heart disease along the Y-axis and those over 65 years old along the X-axis. By inspecting the chart, we can see that the closer a state or a cluster is to the top right corner, the more its population falls into the risk group. Now let’s do the same for the epidemic risk.

Example 1 – Clustering of states based on population with heart diseases and people of age

For this example, we’ll observe the daily infections per available hospital beds compared to the death rate. Grim stuff, but it’ll always be that way when dealing with an epidemic. Alright, now we have four KPIs in total – two from the first chart and two from the second. However, if we try to plug them all in at once, it’ll be hard for anyone to discern anything from the presented results since it’ll represent four dimensions.

Example 2 – Clustering of states based on the daily infections vs hospital capacity and COVID-19 death rate

So, considering that the closer you are to the top right corner in all cases, the worse things are, we can just add the results together for both charts (X-axis + Y-axis) and then place them in a new visualization. This is done only to help visualize the states in a way understandable for our users. Our algorithm has already formed its clusters based on all 4 indicators. It can easily work in 4 dimensions or even more, while the human brain cannot visualize a 4-dimensional scatterplot, so we apply the trick of adding the indicators.

Example 3 – Clustering of states based on all 4 indicators: population with heart diseases; people of age; daily infections vs hospital capacity; COVID-19 death rate

Continuing this example, let’s pick the group composed of Wisconsin and Minnesota since now it’s time we talk about our other advanced sheet – predictions by state. Here we use an algorithm for time series forecasting. We won’t go into detail about how it works, but it is considered the best combination of various statistical approaches. From the image below, you can see that the blue line represents the daily infection rate, so with that info, we can get a rough estimation of this cluster’s state in the following months, which can be seen through the yellow dotted line.

Furthermore, we’ve set up the algorithm to provide another prediction represented by the red dotted line. It shows how many people are being treated in intensive care units. This prediction takes into consideration the recent trend of people going into intensive care as well as the predicted daily infections. As you can see Wisconsin and Minnesota will run out of intensive care unit (ICU) beds roughly by the end of the year if nothing changes. So, how can we utilize this information?

Example 4 – Predictions for New Daily Cases and People in ICU Beds

Insight turned into action

Let’s dive into our example’s insights, shall we? From the looks of things, we could safely conclude that if this cluster of states finds a way to increase the total number of ICU beds and staff them properly, then that will help them prevent a dangerous situation or a worst-case scenario. Another thing that could possibly help would be to increase or better enforce the measures those states have implemented against COVID. In any case, the stats show what would happen if nothing changes at all, so, at the very least, this could serve as a wake-up call to jump to action.

With this information at hand, the hospitals in the two states can pay careful attention to the situation and act accordingly, as well. Let’s say that a hospital in Milwaukee County, Wisconsin, notices that at the current rate of accepting patients, it will be full by the end of the month. In that case, with the help of this dashboard, it can locate potential nearby hospitals with some capacity and direct new patients there. It’s important to know if this transfer of patients between cities, counties, or even states is possible because, after all, COVID isn’t the only illness they’re dealing with. Furthermore, with an overbooked medical center, it’s harder to make an appointment for check-ups on other diseases like cancer or diabetes.

And last but not least, we believe that companies from the pharmaceutical and medical devices industries need to pay careful attention to all of this as well. This epidemic is already the cause of some response and delivery delays. Furthermore, plenty of life sciences companies create and distribute products with expiration periods. The insights generated from these advanced algorithms could be the difference between a fully utilized batch versus a discarded one due to product expiration. The key here is to work alongside hospitals to find solutions to these emerging issues, and if they avoid this, then many life sciences businesses could be taking quite a hit.

Dedicating the necessary time, doing the research, and allowing algorithms to at least point you in the right direction is required in these tough times. Healthcare institutions need to be constantly aware of everything in order to be always prepared for the worst. The examples we went through are just a small taste of the foreknowledge you could end up with if the data surrounding us is adequately utilized.


We hope you found this sneak peek into our COVID dashboard insightful. We encourage everyone working in the healthcare and life sciences fields to at least start considering the data at hand, because as you saw, just observing a small portion of the problem started pointing us to worthwhile solutions. Furthermore, this data can assist not only the health care system and the associated businesses but policymakers as well. A failure to plan is a plan to fail, after all. So, how do you plan to deal with this uncertain future?