Your data is racist, classist, and sexist. Your data is homophobic and gender normative. It’s filled with centuries of institutional inequity and unconscious bias.
The worst part is, you don’t even know it. Thanks to modern analytics practices, which often conceal data models through sophisticated statistical methods, it’s easy to think data bias is a thing of the past. This means you have to work consciously to untangle the messy intersection of statistical bias and our history of inequality. The good news? You can dismantle structural racism one algorithm at a time.
Statistical Bias
Bias is a familiar concept in the world of data. Statistical bias goes by many names. There’s confirmation bias, where you are seeking information in an effort to prove a previously formed opinion or hypothesis. There’s selection or sample bias – where your selection methods leave you with non-random data, people, or groups to analyze. And don’t forget the prickly offspring of selection bias: undercoverage, voluntary response bias, and nonresponse bias. And of course, confounding variables – factors that correlate with both your independent AND dependent variables. Check out this helpful site for more background and info on these and other statistical bias.
While statisticians and data scientists can quickly recognize these factors and deploy strategies to address them, the picture becomes murkier when we consider the historical inequity intrinsic to the data itself, and to our methods of collecting it.
Historical Inequality in Today’s Data
The history of racism in America rears its ugly head in today’s data. Mortgage redlining, for instance, impacts current racial segregation and income disparity. How does this affect your data? A common practice is to append geographic data to existing datasets. This can produce additional data points from which to build more robust predictive models or create more nuanced segments for marketing automation (for example). The problem with geographic data is that is reflects centuries of racial segregation and institutional racism. Determining a constituent’s likely median income and household status based on a ZIP code is often a proxy for race and a result of historical inequity. Some companies will produce appended data for your organization, and will go so far as to provide pre-baked “segments” based on a constituent’s location. Examples of these segments include: “Multi-Cultural Mosaic,” “Tool-belt Traditionalist,” “Low-rise living,” ”Country Strong,” or ”City Roots.”
Let’s pretend you’re sending out a mailing on new program offering. In an effort to be a good steward of your organization’s dollars, you are trying to predict the likelihood that the mailing will be delivered and opened. Let’s say your model suggests it would be unwise to send mailings to “multicultural mosaic” or “low-rise living” segments. Will you accept that recommendation? Are you perpetuating historical inequities by withholding program information in an effort to save money?
The Problem(s) with Algorithms
Machine learning algorithms power most modern digital tools and decision engines drive much of the content you online. Much has been said about algorithms letting the data do the talking. Bias is lodged in human brains, the argument goes, so taking humans out of the picture removes the bias, letting data exist as is.
The problem with this thinking is that algorithms do not construct themselves. To build a decision engine itself requires a series of decisions made by humans: what data to include when testing and training the model, for example, or what outcome the model is optimizing. These two decisions present countless opportunities to bring bias and inequity into an algorithm. This is where statistical bias and inequality meet in terrifying ways.
In her book, Automating Inequality, author Virginia Eubanks describes a predictive algorithm used by Allegheny County, Pennsylvania’s child and human services department. The model has the laudable intention of predicting child abuse. To construct this model, department staff decided which specific data points to predict (independent variable) and which dependent variables would be used to train the model.
The independent variable they chose was re-referral from community members. Re-referral is the term for contact made by a community member to family services to express concern about a child’s welfare after an initial complaint had been received. This includes anonymous complaints. A critical point here is that re-referrals are not substantiated claims of child abuse or neglect. They are, by definition, simply complaints from community members, including anonymous complaints. While many complaints trigger investigations that eventually uncover evidence of child abuse or neglect, many do not. A second critical point is that historically, African Americans and other minorities are more likely to be named in re-referrals that cannot be substantiated than whites. The result: the Allegheny County model is not specifically predicting abuse or neglect, it’s predicting whether or not someone will receive an additional complaint, a complaint that may or may not be substantiated, a complaint shaded in racism.
A further issue with the Allegheny County model: To train the model, department staff used publicly funded mental health and drug and alcohol services, who used the services and what outcomes came of their participation. Because wealthier individuals (e.g., individuals residing in areas with higher property values) are more likely to use privately funded services, the reliance on “publicly funded” data resulted in oversampling lower income residents. The Allegheny County model thus rests on a biased data set that, in turn, feeds the prediction of child abuse. The model targets poor residents by oversampling their mental health data relative to that of their wealthier neighbors.
So What Do I Do?
Simply put, when working with data, whether you’re analyzing patterns, creating decision engines, or training a machine learning algorithm, ask yourself this question: how am I helping to take structural racism apart? Your data task may have a purpose seemingly unrelated to this question, but you should still be considering how your work can clearly identify and challenge historical inequities that have, for generations, been layered into data. This can be done in several ways:
- Think critically about approaches and assumptions made in the research and analytics you are conducting. Statistical biases is easy to overlook but often it most negatively impacts underprivileged populations.
- Data collection, surveys, and copywriting in digital tools should be inclusive of all populations. Seek (and pay for!) guidance from a diverse range of experts to ensure you are crafting a digital experience that is inclusive, or you risk in-equitably narrowing your program’s population.
- Ensure your analytics tools and research are mindful of reliance on data that reinforces historical inequities.
- Know where your data comes from, and know how your models work.
This is hard work, to be sure. It requires effort to fundamentally shift how you approach data modeling. But to say the payoff is important is underselling its impact. In a world where data is gathered at increasing speed and algorithms are increasingly part of everyday life, we stand at a familiar point. Will we acknowledge our history of inequity and consciously make choices that steer us away from our past? Or will we close our eyes in the comfort of our privilege and let the next wave of racism, sexism and classism wash over the digital age with the power of a billion ones and zeros?