Skip to content

Can a Machine Learn Inclusivity? That Depends on the Teacher

4Q4 Podcast, Grants, Interviews

Niki Kilbertus and others in machine learning believe that overcoming bias requires a research culture reboot. We should hear them out.

Digital Impact 4Q4: Niki Kilbertus on Closing the Fairness Gap in Machine Learning


00:00 CHRIS DELATORRE: This is Digital Impact 4Q4, I’m Chris Delatorre. Today’s four questions are for Niki Kilbertus, doctoral student and researcher at Max Planck Institute for Intelligent Systems. With funding from Digital Impact, Kilbertus is developing an online library to help counteract bias. With racial unrest shedding new light on AI’s fairness problem, his open source tool called Fairensics, is aiming for a more holistic fix.

00:36 CHRIS DELATORRE: Ensuring fairness in artificial intelligence. This seems like a lofty expectation. But despite the challenges, ending discrimination in digital services is something you’ve set out to do. Where does the concept of fairness get lost in the stages of app development and what would a failsafe look like?

00:57 NIKI KILBERTUS: Right. That’s a tall order, I agree. I think that the most important point here is that we’re not just trying to fight this specific technical detail of one algorithm or multiple algorithms in isolation. I mean, these issues of fairness and injustice and discrimination are much larger than just a technical algorithm or a technical solution. These are things that are deeply ingrained and rooted in our society and sort of our collective thinking almost. I think therefore it’s sort of delusional to think that the challenge of achieving fairness or avoiding discrimination in machine learning and in AI is any smaller than the general discussion we’re having about these issues.

“When it comes to these narrow statistical definitions, there will never be a satisfying standalone check for fairness.”

Algorithms can certain exacerbate or amplify and even create harms by themselves. And if the question is where does it happen in the stages, that could happen anywhere. And that is precisely the issue. We cannot just put our finger on, oh it’s the data and only if we get the data right everything will be fine. Or it’s the optimization technique that we’re choosing. It could happen in all the design choices, from the very beginning to unexpected or unforeseen consequences in the long future as sort of an aftermath. And there’s no real failsafe solution I think that we can hope for. But what is crucial is to bring as many perspectives and as many viewpoints to the table while developing these things. So we can already start thinking about each of the stages. What could go wrong? How can we anticipate what could be going wrong?

And what really matters here is to bring the people to the table that are effected by the algorithms. It’s not enough to just know that people are suffering hardship in society that may be exacerbated by algorithms, and sort of factor them into our decisions and sort of me trying to think about the people that could be harmed. We really need these people at the table when it comes to making design decisions and when it comes to developing these algorithms, and also when it comes the decisions of whether they should be deployed or not. So I think that’s the broader issue that we need to tackle, also as a research community.

03:14 CHRIS DELATORRE: You shifted your research in 2016 when ProPublica exposed racial bias in the court system—specifically a risk-assessment tool that exhibited bias against Black defendants. The software developer disagreed, but according to you, they were both right. As it turns out, a difference in approaches had resulted in mismatched conclusions. How often is this the case? In the past, you’ve said it’s a problem of scholars relying on different models and often being unaware of each other’s work. If good intentions aren’t enough, then how can we work toward a credible standard for assessing ethics going forward?

03:53 NIKI KILBERTUS: First things first. I think in this specific example, we’re talking about two very concrete and narrowly defined statistical notions of discrimination. So, when I said both parties were right in the past, I’m really referring to these definitions that can be checked sort of in a binary fashion, either they’re satisfied or they’re not. And while they’re way too narrow to do the term “discrimination” more broadly justice, they can nevertheless be useful criteria to figure out that something may be amiss.

“The people we want to protect from algorithmic discrimination are the people who are underrepresented at the top.”

So, typically what happens in these statistical criteria is that one splits all the data points that one has observed into two groups according to some sensitive attribute—for example into racial or gender groups, so it could also be multiple, not just two. And then within each of these groups we compute various statistics. For example, we can look at what fraction of people in each group got positive decisions, or how many errors were made in each of the groups. And it seems quite natural that we wouldn’t want an automated decision-making system to make errors at different rates in the different groups. The issue here is that errors can be measured in different ways. And that’s precisely what happened here.

So, one group, ProPublica, decided to measure error rates in one way, and the algorithm designer decided to measure error rates in another way. And both of them seemed like reasonable fairness criteria. You don’t want errors to be made in any way, disparately between the two groups. And later, I think in 2016, researchers have proven mathematically that you cannot have both at the same time—both of these criteria for matching the error rates. And that is unless it’s a very simple or a very unlikely scenario.

And this is precisely what happened here. So, ProPublica pointed out that one criterion isn’t satisfied, and it seems as if this criterion intuitively is very important to satisfy fairness. And the algorithm designer countered that they looked at another criterion that they did satisfy. So now we know that they could have not satisfied both, that they sort of had to make a choice. And that already shows that when it comes to these narrow statistical definitions, there will never be a satisfying standalone check for fairness. They can be good pointers to verify, you know, as basic checks, and that something is amiss here that something seems odd. But they can never be a certification of, you know, this thing satisfies fairness. I think realizing that there is not this one-size-fits-all kind of solution is really important in trying to design these systems.

“We need to give the very people who suffer from the hardship more agency and more power to make decisions.”

06:40 CHRIS DELATORRE: When you began developing Fairensics a few years ago, you say there really was no way to see what others in the machine learning community were doing. But this changed when companies like Amazon, Google, and others were found to have AI systems that perpetuated certain types of discrimination. What happened next changed the AI fairness landscape. You describe the research community and industry as starting to “pull in the same direction in terms of trying to solve and mitigate the problem,” IBM and Microsoft included. This makes me think of the California Gold Rush when hundreds of thousands migrated west to find their fortune. Now, let’s assume that all of this new activity is primarily driven by the concept of fairness. How can the research community help to guarantee transparency where everyone plays by the same set of rules, and what are the implications otherwise? Are we looking at another Wild West?

07:40 NIKI KILBERTUS: I think that really is the one million dollar question here. Again, bringing about sort of larger scale structural societal changes that we certainly desperately need at this point in time is a much bigger endeavor than just the algorithm development. But I think you’re right that there is a danger of viewing this sort of from a purely technocratic perspective and saying that, you know, we’ll fix the problematic algorithms by developing even more algorithms.

And now with all these parties—industry, governments, researchers – coming to the table, there is a danger of sort of making sure that people, as I may have said, pull in the same direction—who sets this direction and how can we make sure we set the direction properly? I think that there won’t be a good top-down approach to sort of set this direction properly. And the reason I don’t think this is going to work is that the people we want to protect from algorithmic discrimination are precisely the people who are heavily underrepresented at the top. So, even if industry and governments and researchers get together and even if they could somehow magically decide on a single direction on what they should do to build fair systems, that decision would largely exclude all the voices of the very people they are trying to support and protect.

So, put very bluntly, only the people at harm can actually tell us how to improve their situation. We shouldn’t think that we can do this and that there’s sort of one right way and it’s a logical path and we can figure it out just by thinking about it. We really need to talk to these people and we really need to get them to the decision table and have them included in the decision processes and their viewpoints represented.

This is something that needs to start very early on in our systems, from kindergarten to university, from management to faculty positions. We need to actively seek to work against our inherent biases and the status quo and to give the very people who suffer from the hardship more agency and more power to make decisions. So, we shouldn’t just ponder sort of how we can achieve fairness in a purely logical fashion and in specific algorithms but we really need to make sure that there’s a larger diversity and viewpoints represented when we make these decisions.

“We need to avoid the technocratic message of ‘we will fix algorithms and thereby inequity will go away.'”

10:05 CHRIS DELATORRE: You see a disconnect between the technical and ethical aspects of developing AI. You’ve suggested that scholars are entrusting or consigning matters of ethics to policymakers and private companies. Something you said about the ProPublica piece resonated: you asked why we need pretrial risk assessments in the first place. Given that the research community should be focusing more on the why, what responsibility do researchers have to address inequality in machine learning, and where does that responsibility end?

10:40 NIKI KILBERTUS: As researchers we have two types of responsibility. First, we need to work on the inequity that we have within our own community. I think that machine learning here is on a path where underrepresented groups are really having their voices heard – not by everyone in the community but I think that there’s a general change going on and I hope that this will allow us to really increase diversity within our own ranks.

I mentioned this before, I think it’s really important to have these viewpoints diversely represented in the research community, so we can pick out the right problems and right questions to even ask in the first place before trying to jump ahead and just develop solutions to problems that may not even be the actual issues. This is not just the responsibility to society at large but also towards our own culture and colleagues. So this is even just among friends sort of being respectful to everyone.

And with that I hope that we can take a fresher look at what we should be doing as a machine learning research community, what are the right questions in terms of injustice and unfairness, and where are maybe the points where we can actually make a change.

So, there are really great role models in the community that try to hammer home this point and fight tirelessly against all the backlash they’re getting. At the risk of maybe forgetting many and mispronouncing some of the names, there are people like Rediet Abebe, Timnit Gebru, Joy Buolamwini, Deb Raji, Shakir Mohammed, William Isaac, and many more. So people should really follow them and support their voices and listen them and see what we really should be doing as a community.


“There is no consensus on how discrimination in machine learning algorithms should be assessed or prevented.” Learn more about the origin and aims of the Fairensics project.

And on a broader scope, or as a second responsibility, I think we need to make sure that we don’t oversell what we can do within machine learning algorithms. So, avoid the technocratic message of ‘we will fix algorithms and thereby inequity will go away.’ So, I think we have a responsibility to communicate these things in a more nuanced fashion to the broader public and really highlight that the key issues, the underlying structural issues, will not go away just because we developed a new optimization algorithm.

I think especially for researchers, there rarely comes this sort of handing off component that almost seems to be more as part of a software engineer at a company that develops an algorithm and develops it to specification and then hands it off and feels unresponsible. And so if we include that into maybe it’s not a research role but a software development role, then I do not think that responsibility ends at any point.

I think we all have a responsibility, especially when we’re working in research, to constantly keep questioning our approaches, constantly try to think outside the box, and constantly ask ourselves what may we have missed, what could there be that we have missed. And as soon as we find one of these things, definitely speak up, mention it, and also be willing to accept and state publicly that we have made mistakes when we realize them in hindsight. So, I don’t like to think about sort of this end of day handing it off cleaning your hands kind of situation. I think we have to constantly be skeptical of the approaches we are developing.

14:14 CHRIS DELATORRE: Niki Kilbertus, doctoral student and researcher at Max Planck Institute for Intelligent Systems, thank you.

Digital Impact is a program of the Digital Civil Society Lab at the Stanford Center on Philanthropy and Civil Society. Follow this and other episodes at and on Twitter @dgtlimpact with #4Q4Data.