Data plays a very prominent role in our everyday lives, whether we are aware of it or not. Many people by now have accepted that data is at the heart of most things we do in the digital world. However, few give a second thought as to where that data comes from, or what the consequences would be if the data that we rely on was incorrect.
When the data we have to work with has not been gathered objectively, the conclusions we draw will similarly not be as objective as we might want them to be. In this way, it is possible to introduce our own human biases into the world of data analytics.
Data Collection Methods
There are numerous techniques businesses today can use to get the data they need. One of the most popular methods is data scraping. Data scraping is difficult to set up, so it is impractical for someone without any prior knowledge or experience to do it. However, once set up, it is virtually entirely automatic. Because of this, many businesses are scraping huge amounts of data every day without a human ever reviewing it.
Data can either be scraped from a single source, or it can be taken from a multitude of different sources. Businesses who scrape the data from one specific source will likely want to develop a tool designed to facilitate data scraping from that service. On the other hand, those who are looking to access data from a wider range of sources will want a more versatile scraping tool.
Another method of gathering data that is also popular because it is mostly automated is through online tracking. If your business has a website or a mobile app, this provides you with an easy way of gathering data about individual customers.
Most people don’t think twice about the data that the apps on their phones share with the developers, so they probably aren’t even aware that you can collect data from them let alone that you actually are. When people are aware that data has been gathered, their behavior often changes accordingly. Online tracking is a passive method of data collection that enables you to unobtrusively gather all the data you need from your users.
Some online services try to keep their data off-limits for any unauthorized access. Others are much more open about things: the businesses who are more eager and willing to share their data will often produce APIs that other businesses can use in order to interface with their systems. APIs are what enable so many companies to gather so much data from social media platforms. Without APIs to facilitate things, web scraping would be significantly harder.
Mask of Objectivity
We like to think of data as being something cold and objective, much like computers. However, things are nowhere near this simple. For one thing, any data that we use has to first be gathered and the methods that we used to gather data can have an impact on its accuracy. We are all familiar with the concept of a leading question, where the phrasing invites a particular answer. Although this is a clear and obvious example of bias, most cases are nowhere near as clear-cut.
When it comes to data collection, many of the biases that have the greatest effect on us are entirely subconscious and we are completely unaware of them. Other biases in our data reflect assumptions that are prevalent throughout society. For example, if there are seven billion people on earth, half of them are women. And yet, far fewer than half of all senior positions such as heads of state, heads of governments, and heads of corporations are held by women.
The gender disparities we see across industries seem to be massively exaggerated in the tech sector. As a result, we have a tech industry that is largely focused on men at the expense of women. Nowhere is this more evident than in big data, where the focus on men has limited the ability of big data techniques to help us address issues specifically affecting women.
Bias in Big Data
The way we define what is normal dictates our baseline for any laboratory experiments we conduct. It doesn’t matter whether you’re conducting medical research or data analytic research, if your benchmark for what constitutes normal is off, then all the conclusions you will similarly be inaccurate.
Many within and outside the data analytics industry are under the impression that it all comes down to their statistical analyses. In other words, as long as the majority of data is right, the conclusions will similarly be right.
However, we know this to be untrue. In fact, the way we gather data has a huge impact on its reliability. If biases have come into play during the collection of data, then the analytics of that data will always be similarly tainted.
Data analytics is reliant upon our ability to source data responsibly. It’s no good assuming that the maths will be able to account for everything – it can’t account for what we don’t know is there. Data analysts need to consider the source of their data in much more detail than they currently do.