• Big Data
  • Federica Pelzel
  • SEP 15, 2017

Big Data Will Be Biased, If We Let It

Ready to learn more about Big Data & Data Science? Browse courses developed by industry thought leaders and Experfy in Harvard Innovation Lab.

If I had a penny for every time I’ve heard “data doesn’t lie”…

For those of us who have the ever exciting and growing task of working with Big Data to help solve some of organization’s biggest inefficiencies, questions, or problems, perpetuating bias is a way too easy-to-make mistake, and we should all be familiarized with it by now.

For everyone else, here’s what going on:

Quick recap for those who feel lost when data jargon gets thrown at you


(If this isn’t the information you’re looking for, move along to the next section)

  • Big Data is, well, a lot of data. Quantitative and qualitative indicators, that at a large amount get used to identify patterns, trends, and relationships.
  • Algorithms are ‘a process or set of rules to be followed in calculations or other problem-solving operations’. For example, if I’m deciding what to wear in the morning i’m mentally using an algorithm that considers the weather, my mood, where I’m going, and that Ben & Jerry’s I shouldn’t have eaten last night, which leads me to pick my outfit.
  • Machine Learning ‘provides systems the ability to automatically learn and improve from experience without being explicitly programmed.’. If I was a machine learning algorithm I wouldn’t have had that Ben & Jerry’s last night, because I would’ve learned I’d live to regret it from the last time I did it.

How bias has been introduced into our ‘smart’ world


My first encounter with the concept of data driven bias blew my mind, and made me wonder how I hadn’t seen this before. It was ProPublica’s essay titled Machine Bias. Right after the titled it stated:

There’s software used across the country to predict future criminals. And it’s biased against blacks.

The tl;dr story here is that several states in the US implemented an algorithm to predict the risk of defendants in court reoffending, and use this value as a factor during sentencing. Interestingly enough, race or ethnicity claimed not to be variables in this algorithms, but it somehow fails blacks the most. Only 20% of defendants who were identified at a high risk of committing a violent crime in the future actually did; and it mis-labeled black people at almost twice the rate as whites.

The whole essay and their analysis on the data are definitely worth a read, but most importantly this isn’t the only place algorithms are failing minorities, and other groups.

Credit: Ford Foundation

It’s important to note, before we go any further down this rabbit hole, that I don’t think the intentions of people building these tools are to deliberately discriminate or create any sort of bias. We could argue the exact opposite: Tools like these are meant to limit the individual bias of whoever is conducting a risk assessment by providing ‘undeniable’, quantifiable, trust-worthy data.

For a long time now my motto has been “If nothing changes, nothing changes”, and it rings most true in this case. Bias is nothing new, and it requires specific action to be overcome, in and out of the data science world. By feeding our algorithms history through data, we’re implicitly telling them to discriminate against everyone who’s been historically discriminated against.

Some of these examples are embedded into our culture and we accept them as the norm -although not always happily: Your gender, income, level of education, and other factors determine how much you will pay for healthcare. Some health-related indicators do as well, for example whether you smoke or not. It’s not unusual, however, for a non-smoking healthy woman to pay more for insurance than a sedentary man who smokes 2 packs a day, even though many doctors would agree that based on that data the latter is more likely to fall ill.

On the opposite side of gender bias: Men do pay more for car insurance, for similar reasons. ‘For instance, an 18-year-old male living in Nevada would pay an average of $6,268 a year to insure his sedan if he had the misfortune to grow up there. That’s 51 percent higher than what his twin sister would pay (assuming they have the same grades and driving records), who would fork out just $4,152 to insure an identical car, according to a CoverHound analysis.’. And since we’re on the car insurance subject, minorities pay more for car insurance than white people in similarly risky neighborhoods.

This matters now more than ever


Algorithms of this kind have existed for decades and a lot of the time are used by organization’s to be able to scale their operations, by using repeatable patterns that can be applied to everyone.

The reason why we need to look at this now more than ever is that, through a growing and thriving tech industry, these models are being applied pretty much everywhere: From court sentencing, job searching, credit card, college, and mortgage applications, consumer goods, etc, to AI speaking botsevaluating teacher’s performancestargeted social media ads, and more.

This means that, whether you’re interested in big data, algorithms, and tech, or not, you’re a part of this today, and it will affect you more and more.

If we don’t put in place reliable, actionable, and accessible solutions to approach bias in data science, these type of usually unintentional discrimination will become more and more normal, opposing a society and institutions that on the human side are trying their best to evolve past bias, and move forward in history as a global community.

What’s being done today


The solution to this issue isn’t to stop innovation around big data algorithms and machine learning. Luckily, progress is being made on several fronts.

The algorithm heroes

Organizations like AlgorithmWatch and The Algorithmic Justice League founded by Joy Buolamwini (Her amazing TED talk below) are striving to help evaluate and identify bias in existing algorithms by providing education and training materials, as well as a collaborative and inclusive space for people to report bias in algorithms, and help solve these issues as a community.

There are many other individuals, researchers, and organizations working on different ways to approach the situation.

Policy changes with GDPR in Europe

Unfortunately, organizations like AJL aren’t enough to guarantee the necessary change. Change needs backing policy. In Europe the the GDPR (the General Data Protection Regulation that goes into effect in May 2018 in the EU) is going to regulate three key factors involving data bias.

First, profiling, which they define as

Any form of automated processing of personal data consisting of the use of personal data to evaluate certain personal aspects relating to a natural person, in particular, to analyze or predict aspects concerning that natural person’s performance at work, economic situation, health, personal preferences, interests, reliability, behavior, location or movements.

This comes together with offering a clear explanation to consumers on how their data will be used, and providing them with the option to opt out.

Secondly, the right to an explanation. When companies use automated decision-making, users will have a right to ask to an explanation and dispute decisions if they were made exclusively by algorithms and data. The scope of this action hasn’t been fully defined, but it’s expected to fit into credit applications, job searching, and others areas of concern.

Last but definitely not least, there’s a specific bias and discrimination section, preventing organizations from using data which might promote bias such as race, gender, religious or political beliefs, health status, and more, to make automated decisions (except some verified exceptions).

What needs to happen next


Unlike human bias, we can quickly teach algorithms to consider and avoid bias, by including it as another indicator. We can also put policy in place to prevent data driven bias from happening. In my opinion there are three main areas we need to work on in the near future to make sure bias is diminished in the data space.


Potentially the most important aspect, and the most accessible one in the short term is promoting and requiring training and education for people participating in the creation and maintenance of automated decision-making tools, and other data-driven tools prone to bias.

In the tech industry we’ve seen a lot of controversy over bias, and have fought that by adding education and trainings on the HR level; trying to spread the word on the value of diversity and equality on the personal level. It’s time to make that training broader, and teach all people involved about the ways their decisions while building tools may affect minorities, and accompany that with the relevant technical knowledge to prevent it from happening.

People who aren’t part of the tech industry should also be aware of this, enough to be able to identify when they might be victims and speak up. Without individuals sharing their stories, and how these methods have changed their lives, the message becomes cold and impersonal, which is exactly what we’re trying to avoid.


I like GDPR’s case in particular (even though it remains to see how well it’s implemented), because it comes from a mandate. This is not a suggestion or an option, it needs to happen. The EU parliament has determined that data security is pertinent to all its citizens more and more, and has identified that it also can be unfair to some of them. There’s incredible value and validation in this.

This kind of data regulation, especially around bias and discrimination, is in my opinion key for the healthy growth of the big data industry. Without the pubic sector’s leadership, the opportunities to dismiss the need to pay specific attention to the people who are being discriminated against are too tempting and affordable.


Finally, and this is my personal belief, I think some level of data transparency from the organizations collecting it and developing these tools would help identify and prevent this sort of thing from happening in the future. Machines can learn, but human insight needs to be their supervising teacher, and by opening and sharing non-personal data to be analyzed for bias, organizations can benefit from the power of a diverse global community looking to promote fairness.

Disclaimer: This isn’t meant to be a scientific analysis of existing algorithms or a technical evaluation of the landscape, as much a humble translation of what’s been going on for people who aren’t always involved in this space.

Originally posted at: medium

The Harvard Innovation Lab

Made in Boston @

The Harvard Innovation Lab

Matching Providers

comments powered by Disqus.