Machine Learning Solutions to Address Disparity and Create Fairness
By Dr. Emily Barrow DeJeu and Dr. Emily Diana
Media Inquiries- Director, Strategic Partnerships
Dr. Emily R. Diana, along with a team of researchers, suggests two simple algorithms for creating fairness when there is disparity between groups. One uses a simple algorithm to ensure groups who are typically worst off have the best possible outcomes, and another uses a proxy learning model to predict possible disparity in situations when data about protected attributes is not available.
The Problem: Organizations want to achieve fairness, but data about protected attributes can be challenging to work with
Companies are mindful of the ways that protected attributes like race, ethnicity, and sex can create disparities between groups. Today’s business leaders generally understand that ignoring these differences perpetuates inequality, and many want to tackle disparity directly and pursue fairness.
But it’s not always clear how to take an effective data-driven approach to equity. For example, industry leaders might assume that equalizing error rates – a common framework for pursuing group fairness – is a good approach. However, Diana’s research suggests that in practice, equalizing errors can further disadvantage some groups. This is a serious problem in situations where any increase in disparity is a bad outcome, such as in predicting the need for medical care or social welfare interventions like restraining orders to prevent domestic violence.
Equally challenging is the fact that in many cases, data about protected attributes is not available, sometimes for legal reasons (for instance, in the U.S., it is illegal to use race as an input in consumer lending models), and other times for policy reasons (organizations might not want to ask their clients for sensitive data). Companies that want to pursue fairness while respecting legal and policy precedents might find it difficult to take data-driven approaches to equity. As Diana puts it, how can a company be fair by race when they don’t have access to data about race?
Solution #1: Two simple algorithms use a minimax approach to ensure worse-off groups have the better outcomes
Diana and her team tackled the first problem – pursuing fairness between groups – by developing two algorithms that use a minimax approach. Unlike approaches that try to equalize disparity among all groups, the minimax approach works to ensure groups that are usually worst off have the best possible outcomes. The first algorithm finds a minimax group fairness model, and the second navigates tradeoffs between relaxing certain aspects of the minimax model and maintaining overall accuracy. Together, these algorithms allow users to set goals for fairness and then adjust tradeoffs between groups to accomplish those goals.
Naturally, people initially imagine applying these algorithms to scenarios involving fairness between groups of people, and Diana’s research is well-suited to those kinds of applications. For example, she and her team analyzed data used to predict how likely a criminal defendant is to reoffend – data which course can use when deciding whether or not to grant bail. It turns out that using an equal error approach, which requires all groups to have the same error rate, creates high error rates (and thus inequity) for almost every group and does little to advance fairer bail decision making . But Diana’s minimax algorithms allow users to control for minimal error rates, which improves fairness for some groups and comes at a minimal cost to others – outcomes which support fairer bail decisions.
But Diana’s algorithms don’t just apply to issues of fairness between people groups; companies can use them in any scenario where they need to create statistical constraints for certain subsets of data. For instance, Diana and her team used these algorithms to try and predict public bike demand in Seoul based on seasonal trends. Once again, an equal errors approach increased errors and created more disparity, while Diana’s minimax algorithms allow users balance improved accuracy for summer predictions with slightly worse error rates for the other three seasons – a tradeoff that seems sensible and worth pursuing.
Solution #2: A “fairness pipeline” uses protected class data to train fairer, more accurate prediction models
Diana’s minimax algorithms are effective when data about groups is available, but what if it’s not? When fairness is a concern in situations like this, using proxies for race or ethnicity is standard practice. But it is not clear what proxies are best, nor are there straightforward solutions for how to create these proxies algorithmically.
Diana and her team tackled this problem by creating what they call a “fairness pipeline” in which an “upstream” learner that does have access to protected data learns a proxy model for sensitive features like race, ethnicity, and sex. That proxy model can then be used “downstream,” when protected data is unavailable, to train a machine learning model to be fair.
Diana tested her team’s fairness pipeline on a dataset that predicts income based on race. She found that their “true labels” model, which used real protected data to train a proxy, generated far more accurate predictions than a ”baseline proxy” approach, which attempts to predict sensitive attributes in the absence of any actual protected data. This is notable, says Diana, because a baseline proxy approach often seems like the most natural way to handle an absence of protected class data, but as her research indicates, these models can’t ensure fairness because they aren’t designed to produce it.

Why They Matter: Diana’s tools are straightforward solutions for organizations looking to control for disparate performance between groups
Companies looking for effective machine learning methods to ensure fairness will be encouraged to learn that Diana’s algorithms are feasible solutions. They’re short, fast algorithms that Diana suggests do not take any more time to train than the models an organization would normally use. They’re also flexible: they can be adapted for a variety of classification methods and error rates, and her fairness pipeline, once trained, can be used on future datasets from a shared distribution. Finally, these tools are highly customizable:: users can input their own training data, specify different considerations, and get a model that they can iteratively adjust to suit different performance goals.
For more on Diana’s research into machine learning solutions for reducing disparity and promoting fairness, visit her website at www.emilyruthdiana.com.