Skip to main content

How to Generate Unbiased Data

  • Chapter
  • First Online:
Understand, Manage, and Prevent Algorithmic Bias

Abstract

One motto of our times could be "data is the new gold"—however, it will shine only if it is pure and free of dirt. Biased data can be lethally polluted and thus worthless. For example, a tax authority once asked me to help them build an algorithm to direct customs inspectors to those containers in the port that were most likely to contain contraband. The project could not go ahead because the only data they had was from a very limited number of customs inspections their officers had done in the past year. The problem: Customs inspectors had chosen which containers to check, and they had complete freedom in how to conduct the checks (e.g., they may have limited themselves to opening the first box falling into their hands and accepting the shipment as containing "Louis Vuitton Croisette handbags" because the duffel bags with a "Luis Vitton" label seemed close enough, or they may have completely emptied the container and carefully checked the L’s, O’s, and T’s of the Louis Vuitton stamp of two dozen bags, knowing that variations in these letters are among the frequent tell-tale signs of fake bags).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    This is the simple version of the rule. The complicated version of the rule would take into account a technique called stratification—e.g., if I believe that the persona of the shipper is more important than the individual container, a more efficient sampling strategy might randomly select five containers from each ship regardless of the ship’s size. While this means that the probability of a particular container to be checked is a lot smaller if it is on a Triple E class ship (which can carry up to 18,000 containers) than if it is on a small vessel carrying just 250 containers, it is still random, and by applying weights (capturing for each container what fraction of the ship’s total cargo it represents) we even can undo our stratification when estimating the coefficients of the algorithm.

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Tobias Baer

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Baer, T. (2019). How to Generate Unbiased Data. In: Understand, Manage, and Prevent Algorithmic Bias. Apress, Berkeley, CA. https://doi.org/10.1007/978-1-4842-4885-0_17

Download citation

Publish with us

Policies and ethics