Skip to main content

How to Detect Algorithmic Biases

  • Chapter
  • First Online:
  • 1596 Accesses

Abstract

In the previous chapter, I indicated that monitoring plays a central role in managing algorithms. This is surprisingly tricky. As Ron DeLegge II put it nicely, "99 percent of all statistics only tell 49 percent of the story." As a result, a lot of rubbish is said and done because of meaningless numbers showing up in some report. Even if no bad intentions are involved, a poorly calculated or interpreted number can seriously mislead you. This chapter is a comprehensive review of how best to monitor algorithms for biases from a user’s perspective.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    Ron DeLegge II, Gents with No Cents, 2nd edition, Half Full Publishing Group, 2011.

  2. 2.

    As this is neither a book about statistical hypothesis testing nor a section of the book squarely aimed at statistics pros, I purposely don’t discuss details such as whether to use a one- or two-sided t-test or if a z-test would be better. I refer users to the trusted hands of their data scientist for choosing the best variation of these tests in light of the specific circumstances. The good news is that if you want a directional marker of a problem, all of these tests will wave at you if something is really fishy—just as when you smell rotten eggs, it doesn’t quite matter if you sniff through your left or right nostril and if you hold the egg 1 inch or 10 inches away!

  3. 3.

    Or look up J. Cohen, A Power Primer. Quantitative Methods in Psychology, 112(1), 155–159, 1992, for authoritative guidance on sample sizes.

  4. 4.

    This is because the neurotic aspect of the variable—ringing an alarm bell when the customer actually was perfectly good—had diminished its predictive power.

  5. 5.

    The Basel Committee on Banking Supervision published a very comprehensive review of both common and more arcane metrics for assessing the rank ordering of algorithms in its Working Paper No. 14 (May 2005) titled “Studies on the Validation of Internal Rating Systems.” It singles out Gini and K-S as most useful as well.

  6. 6.

    Have you noticed that the average of the decile with the largest predictions is larger than the single largest prediction made by the algorithm? Well done! This happens in particular with outliers—it looks like our gorilla man was correctly included in the top decile but on an absolute level, the algorithm still has vastly underestimated the number of hairs, treating the furry beast as a human. This is a common phenomenon.

  7. 7.

    It is extremely unlikely that you will observe exactly the same ratio (i.e., 2.7 : 1) in each decile. This is due to two reasons. First, you may realize that rates are bounded between 0% and 100%, which means that any rate of 50% or higher technically never can double. (What actually doubles is the odds ratio, which is an unbounded transformation of rates.) Second, empirically the kind of macroeconomic forces I am alluding to in this example have only a limited effect on very safe borrowers who therefore show less fluctuations across the business cycle than riskier borrowers. As a result, if the population rate rises by a certain factor, very safe and very risky customers tend to see a smaller relative increase in default rates than customers with intermediate risk levels.

  8. 8.

    Chapter 3 showed the structural form of a linear regression to estimate continuous variables. For binary outcomes, the equivalent structural form is a logistic regression, which also has a constant term. If the algorithm has a more complicated structure, it may not have an explicit constant term but if you convert its output into a logit score and then wrap a logistic function around it, you can make a virtual adjustment of the “implicit” constant term.

  9. 9.

    Assume that in your reference data set 50% of the population are female but that only 20% of applicants approved by your CV screening algorithm are women. As we discussed, whether or not 20% versus 50% is a significant difference will depend also on the absolute number of applicants approved by the algorithm in the period under consideration. This pragmatic alternative to a t-test does not consider the actual number of cases but simply assumes something—for example, that each quarter you are screening at least 100 applicants and therefore 20% would be a significant problem.

  10. 10.

    PSI actually is proportional to chi-square, as has been shown by Bilal Yurdakul in his dissertation, “Statistical Properties of Population Stability Index” (2018).

  11. 11.

    The median is similar to average except that it is the actual value of a case where 50% of the sample are lower and 50% are higher (hence the case is “in the middle”). Assume you are in a country where 33 million people earn $1,000 per month each, 33 million people earn $2,000, 33 million people earn $3,000, and 100 tycoons earn $1 billion per month each. The average income in this country is $3,010 but the median is only $2,000—the median therefore is a much better proxy for the “typical” income and what the average income of “normal” people is.

  12. 12.

    While for continuous variables (e.g., income) it is easy to calculate the average or median, this is impossible for categorical variables (e.g., job—you obviously never want to say that a tax collector is the average of an accountant and a robber!). Also, the “mode” (i.e., the most frequent category) often is a poor proxy for the average because often the most frequent category is at the lower end (e.g., the most common jobs often are relatively poorly paid). A better approach is to sort categories by their median outcome (e.g., median income of each job category), then identify the “median” income, and look up which category (i.e., here job) is closest to the population median (e.g., computer programmers).

  13. 13.

    I am sure you know that the favorite food of Martians is potatoes—so Joe’s Potato naturally has most of its branches in areas with a mostly Martian population, and the five Martian ghettos you identified account for a whopping 60% of all branches of Joe’s Potato. This distance variable therefore created a backdoor to redlining for the algorithm.

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Tobias Baer

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Baer, T. (2019). How to Detect Algorithmic Biases. In: Understand, Manage, and Prevent Algorithmic Bias. Apress, Berkeley, CA. https://doi.org/10.1007/978-1-4842-4885-0_15

Download citation

Publish with us

Policies and ethics