When creating dashboards, there is a multitude of ways of importing the data you need. Usually, the company for which the dashboard is created owns and provides the data for this dashboard. Still, sometimes that’s not the case, and you must resort to public sources to collect the necessary data. These can range from social media sites to public registries, but in any case, the data will be openly available to everyone. Working with Open or Public Data can sometimes be challenging and time-consuming, so we decided to share our experience on the matter and show you why it could be a great idea to integrate it within your dashboard.
Do successful companies use Open Data and Public Data?
First thing’s first, what’s the difference between Open and Public Data? Well, the former is information that’s been published on government-sanctioned portals, while the latter is the info that exists everywhere else. If you doubt that these two can increase the return on your analytics investment, then just check out the following examples. Amazon analyzes the preferences of their customers along with the campaigns of their competitors to better position themselves for sales and promotions. Major automobile manufacturers, on the other hand, are using social network feedback for the detection of manufacturing imperfections spotted by their customers. Open or public data can be beneficial to all business spheres, including healthcare. For example, Clear Health Costs puts together data on health care prices from government sources, surveys, and research to provide clear and accurate information to its users. The list can go on, but by now, I bet you’re asking yourselves, “How can we take advantage of this?”. Well, stick around to find out.
What are the use cases?
Your organization probably gets lots of customer feedback on social media sites, but do you take advantage of it? Based on your use case, you can resort to different sources of open or public data, so for starters, let’s list some of the most commonly used ones. Trading Economics, The World Bank, and Eurostat are all great sources of economic data, market data, macro data, and more. These are most often used in financial, accounting, and marketing analytics tools. Twitter and Facebook are commonly used as data sources in marketing analytics dashboards. National Institutes of Health holds an array of valuable sources for analytics in the Pharma and Medical Devices industries. In a world where data grows exponentially, a big part of it will inevitably end up in the public domain. This data can provide you a better perspective on the current state of things, like who the key people in your sphere of interest are, which companies are leading the market there, how your competitors are doing, what are the current prices of various assets or items, etc.
Why is Qlik preferable?
At B EYE, we take advantage of Qlik and its powerful data extract and transformation features. Qlik’s scripting is also super flexible, which allows for some very creative coding approaches. Combine that with Qlik’s user-friendly interface and you can probably see why we’ve chosen this platform. Additionally, we can use Qlik for the whole project – from extraction, data cleansing, transformation, linking to internal data sources, modeling, front end. No need for extra licenses. No need for additional tools.
Our preferred method of integrating data from public sources is using the source’s set of APIs. Most registries/websites/services have developed the necessary functionality because the data is supposed to be public and used by everyone. Qlik is capable of transforming JSON, XML, and other data formats directly into a table format. Qlik’s in-built REST connector makes acquiring structured and semi-structured data easy and intuitive. When an Open or Public Data source does not have an API or a REST endpoint, we can always use Qlik’s HTML text analysis capabilities. These require a higher level of coding know-how to get the relevant data, but It still works wonders, since Qlik can analyze the body of a website and sort out the valuable data from it. Something to keep in mind, however, is that some public registries have set checks in place to boot out anyone that’s overloading their servers. This can be quite bothering for a developer since the timeout lasts from a couple of seconds to several minutes. We can easily account for it by writing a script to watch out for a connection loss to the server.
So now that we have our data extracted, how should we proceed with its maintenance? Well, we can always just download and store the data once. However, in most cases, we need an always up-to-date database, so we will need to set up a reload schedule. Designing some incremental extraction logic is also preferable so we don’t need to download the same set of data every time, but instead, just extract the new, changed, or deleted data. Qlik allows for easy setup of any possible incremental logic – again no other tools or frameworks needed.
Just attaining public data is not going to be enough most of the time. You will need to combine it with your internal data to take full advantage of it. Keep in mind that the public data won’t always come as structured and organized as you’d want it. In most cases, you’ll encounter issues, like having the name of a person or product twisted up in various ways or having different granularities for a seemingly similar attribute like Region. In such cases, we usually take advantage of algorithms to help us organize the data. Examples can be the Longest Common Subsequence (LCS) or Hierarchical Clustering. The beauty of Qlik is that, as mentioned earlier, you can write such algorithms without the need for further tools.
How to structure an Open and Public Data project?
The successful integration of data from various public sources into your own analytics environment would depend on numerous factors, so there’s no simple formula to follow each time. Therefore, adaptability would be considered a top skill to possess when dealing with such a project. Below we can provide some overall guidelines on how to begin your Open Data endeavor.
1. Identify the right source – with so much data going around on the world wide web, it is crucial that you find the correct sources to pull data from for your needs. Think about the long term – will that website still be around for 2, 3, or 5 years? Sources like Google Scholar or the National Institutes of Health, with their vast assortment of links to open data sets, seem to be more reliable than that trending GitHub data repository on the COVID-19 pandemic.
2. Segment your data – now that you’ve found the right sources, you will need to profile the right data. Data profiling is a yo-yo process – you start with a generic definition and then you continue adding and removing filters until you end up with the data that matches the needs of your organization.
3. Adding the data to your dashboards – combining Open or Public Data with the data generated by your organization is the most challenging step. For example, how does a machine know that Barack Obama, B. Obama, and Barack Hußein Obama are all the same person? You may have to create a set of syntactic common denomination functions. Or in other words, functions that will make sure that the symbols are uniform and that we’re linking apples to apples. The best part is Qlik allows for writing such functions – no additional tools needed.
4. Consider scalability – the volumes of Open or Public Data can grow exponentially, for example, social network feeds. You need to measure the speed of increase of the data volume – per day, week, or month. Then calculate where this is going to be in one year. In general, you should always keep scalability in mind when working on such a project.
Considering all the bumps that can occur, we hope you’ve managed to see why Qlik seems preferable for such projects, and potentially more. Regardless, if you’ve used open or public sources or not, we trust this article has managed to show you the numerous benefits this approach can bring to your business. So, do you already have a use case for your company in mind?