Skip to main content

Probabilistic Blocking with an Application to the Syrian Conflict

  • Conference paper
  • First Online:
Privacy in Statistical Databases (PSD 2018)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 11126))

Included in the following conference series:

  • 831 Accesses

Abstract

Entity resolution seeks to merge databases as to remove duplicate entries where unique identifiers are typically unknown. We review modern blocking approaches for entity resolution, focusing on those based upon locality sensitive hashing (LSH). First, we introduce k-means locality sensitive hashing (KLSH), which is based upon the information retrieval literature and clusters similar records into blocks using a vector-space representation and projections. Second, we introduce a subquadratic variant of LSH to the literature, known as Densified One Permutation Hashing (DOPH). Third, we propose a weighted variant of DOPH. We illustrate each method on an application to a subset of the ongoing Syrian conflict, giving a discussion of each method.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 54.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 69.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    In this paper, we utilize a shingling based approach, and thus, our representation of each record is likely to be very sparse. Moreover, [23] showed that minhashing based approaches are superior compared to random projection based approaches for very sparse data sets.

  2. 2.

    The assumption holds when dealing with floating point numbers for small enough \(\delta \).

  3. 3.

    Note that the precision for a blocking procedure is not expected to be high since we are only placing similar pair in the same block (not fully running an entity resolution procedure or de-duplication procedure, which would try and maximize both the recall and the precision).

References

  1. Broder, A.Z.: On the resemblance and containment of documents. In: Proceedings of 1997 Compression and Complexity of Sequences. IEEE, pp. 21–29 (1997)

    Google Scholar 

  2. Charikar, M., Chen, K., Farach-Colton, M.: Finding frequent items in data streams. In: Widmayer, P., Eidenbenz, S., Triguero, F., Morales, R., Conejo, R., Hennessy, M. (eds.) ICALP 2002. LNCS, vol. 2380, pp. 693–703. Springer, Heidelberg (2002). https://doi.org/10.1007/3-540-45465-9_59

    Chapter  Google Scholar 

  3. Christen, P.: A survey of indexing techniques for scalable record linkage and deduplication. IEEE Trans. Knowl. Data Eng. 24, 1537–1555 (2012)

    Article  Google Scholar 

  4. Christen, P., Gayler, R., Hawking, D.: Similarity-aware indexing for real-time entity resolution. In: Proceedings of the 18th ACM Conference on Information and Knowledge Management, pp. 1565–1568 (2009)

    Google Scholar 

  5. Gionis, A., Indyk, P., Motwani, R., et al.: Similarity search in high dimensions via hashing. In: Very Large Data Bases (VLDB), vol. 99, pp. 518–529 (1999)

    Google Scholar 

  6. Gollapudi, S., Panigrahy, R.: Exploiting asymmetry in hierarchical topic extraction. In: Proceedings of the 15th ACM International Conference on Information and Knowledge Management. ACM, pp. 475–482 (2006)

    Google Scholar 

  7. Haeupler, B., Manasse, M., Talwar, K.: Consistent weighted sampling made fast, small, and easy. Technical report (2014). arXiv:1410.4266

  8. Herzog, T., Scheuren, F., Winkler, W.: Data Quality and Record Linkage Techniques. Springer, New York (2007). https://doi.org/10.1007/0-387-69505-2

    Book  MATH  Google Scholar 

  9. Herzog, T., Scheuren, F., Winkler, W.: Record linkage. Wiley Interdisc. Rev.: Comput. Stat. 2 (2010). https://doi.org/10.1002/wics.108

  10. Indyk, P., Motwani, R.: Approximate nearest neighbors: towards removing the curse of dimensionality. In: STOC, Dallas, TX, pp. 604–613 (1998)

    Google Scholar 

  11. Ioffe, S.: Improved consistent sampling, weighted minhash and L1 sketching. In: ICDM, Sydney, AU, pp. 246–255 (2010)

    Google Scholar 

  12. McCallum, A., Nigam, K., Ungar, L.H.: Efficient clustering of high-dimensional data sets with application to reference matching. In: Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 169–178. ACM (2000)

    Google Scholar 

  13. Paulevé, L., Jégou, H., Amsaleg, L.: Locality sensitive hashing: a comparison of hash function types and querying mechanisms. Pattern Recogn. Lett. 31, 1348–1358 (2010)

    Article  Google Scholar 

  14. Price, M., Ball, P.: The limits of observation for understanding mass violence. Can. J. Law Soc./Revue Canadienne Droit et Société 30, 237–257 (2015a)

    Article  Google Scholar 

  15. Price, M., Ball, P.: Selection bias and the statistical patterns of mortality in conflict. Stat. J. IAOS 31, 263–272 (2015b)

    Article  Google Scholar 

  16. Price, M., Gohdes, A., Ball, P.: Documents of war: understanding the Syrian conflict. Significance 12, 14–19 (2015)

    Article  Google Scholar 

  17. Price, M., Klingner, J., Qtiesh, A., Ball, P.: Updated statistical analysis of documentation of killings in the Syrian Arab Republic. United Nations Office of the UN High Commissioner for Human Rights (2013)

    Google Scholar 

  18. Price, M., Klingner, J., Qtiesh, A., Ball, P.: Updated statistical analysis of documentation of killings in the Syrian Arab Republic. United Nations Office of the UN High Commissioner for Human Rights (2014)

    Google Scholar 

  19. Rajaraman, A., Ullman, J.D.: Mining of Massive Datasets. Cambridge University Press, Cambridge (2012)

    Google Scholar 

  20. Sadosky, P., Shrivastava, A., Price, M., Steorts, R.C.: Blocking methods applied to casualty records from the syrian conflict (2015). arXiv preprint arXiv:1510.07714

  21. Shrivastava, A., Li, P.: Densifying one permutation hashing via rotation for fast near neighbor search. In: Proceedings of The 31st International Conference on Machine Learning, pp. 557–565 (2014)

    Google Scholar 

  22. Shrivastava, A., Li, P.: Improved densification of one permutation hashing. In: Proceedings of the 30th Conference on Uncertainty in Artificial Intelligence (2014)

    Google Scholar 

  23. Shrivastava, A., Li, P.: In defense of minhash over simhash. In: Proceedings of the Seventeenth International Conference on Artificial Intelligence and Statistics, pp. 886–894 (2014)

    Google Scholar 

  24. Steorts, R.C., Ventura, S.L., Sadinle, M., Fienberg, S.E.: A comparison of blocking methods for record linkage. In: Domingo-Ferrer, J. (ed.) PSD 2014. LNCS, vol. 8744, pp. 253–268. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-11257-2_20

    Chapter  Google Scholar 

  25. Vatsalan, D., Christen, P., O’Keefe, C.M., Verykios, V.S.: An evaluation framework for privacy-preserving record linkage. J. Privacy Confidentiality 6, 3 (2014)

    Article  Google Scholar 

  26. Winkler, W.E.: Overview of record linkage and current research directions. Technical report, U.S. Bureau of the Census Statistical Research Division (2006)

    Google Scholar 

Download references

Acknowledgments

We would like to thank HRDAG for providing the data and for helpful conversations. We would also like to thank Stephen E. Fienberg and Lars Vilhuber for making this collaboration possible. Steorts’s work is supported by NSF-1652431 and NSF-1534412. Shrivastava’s work is supported by NSF-1652131 and NSF-1718478. This work is representative of the author’s alone and not of the funding organizations.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Rebecca C. Steorts .

Editor information

Editors and Affiliations

A Syrian Data Set

A Syrian Data Set

In this section, we provide a more detailed description about the Syrian data set. As already mentioned, via collaboration with the Human Rights Data Analysis Group (HRDAG), we have access to four databases. They come from the Violation Documentation Centre (VDC), Syrian Center for Statistics and Research (CSR-SY), Syrian Network for Human Rights (SNHR), and Syria Shuhada website (SS). Each database lists each victim killed in the Syrian conflict, along with identifying information about each person (see [17] for further details).

Data collection by these organizations is carried out in a variety of ways. Three of the groups (VDC, CSR-SY, and SNHR) have trusted networks on the ground in Syria. These networks collect as much information as possible about the victims. For example, information is collected through direct community contacts. Sometimes information comes from a victim’s friends or family members. Other times, information comes from religious leaders, hospitals, or morgue records. These networks also verify information collected via social and traditional media sources. The fourth source, SS, aggregates records from multiple other sources, including NGOs and social and traditional media sources (see http://syrianshuhada.com/ for information about specific sources).

These lists, despite being products of extremely careful, systematic data collection, are not probabilistic samples [14,15,16, 18]. Thus, these lists cannot be assumed to represent the underlying population of all victims of conflict violence. Records collected by each source are subject to biases, stemming from a number of potential causes, including a group’s relationship within a community, resource availability, and the current security situation.

1.1 A.1 Syrian Handmatched Data Set

We describe how HRDAG’s training data on the Syrian data set was created, which we use in our paper.

First, all documented deaths recorded by any of the documentation groups were concatenated together into a single list. From this list, records were broadly grouped according to governorate and year. In other words, all killings recorded in Homs in 2011 were examined as a group, looking for records with similar names and dates.

Next, several experts review these “blocks”, sometimes organized as pairs for comparison and other times organized as entire spreadsheets for review. These experts determine whether pairs or groups of records refer to the same individual victim or not. Pairs or groups of records determined to refer to the same individual are assigned to the same “match group.” All of the records contributing to a single “match group” are then combined into a single record. This new single record is then again examined as a pair or group with other records, in an iterative process.

For example, two records with the same name, date, and location may be identified as referring to the same individual, and combined into a single record. In a second review process, it may be found that record also matches the name and location, but not date, of a third record. The third record may list a date one week later than the two initial records, but still be determined to refer to the same individual. In this second pass, information from this third record will also be included in the single combined record.

When records are combined, the most precise information available from each of the individual records is kept. If some records contain contradictory information (for example, if records A and B record the victim as age 19 and record C records age 20) the most frequently reported information is used (in this case, age 19). If the same number of records report each piece of contradictory information, a value from the contradictory set is randomly selected.

Three of the experts are native Arabic speakers; they review records with the original Arabic content. Two of the experts review records translated into English. These five experts review overlapping sets of records, meaning that some records are evaluated by two, three, four, or all five of the experts. This makes it possible to check the consistency of the reviewers, to ensure that they are each reaching comparable decisions regarding whether two (or more) records refer to the same individual or not.

After an initial round of clustering, subsets of these combined records were then re-examined to identify previously missed groups of records that refer to the same individual, particularly across years (e.g., records with dates of death 2011/12/31 and 2012/01/01 might refer to the same individual) and governorates (e.g., records with neighboring locations of death might refer to the same individual).

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Steorts, R.C., Shrivastava, A. (2018). Probabilistic Blocking with an Application to the Syrian Conflict. In: Domingo-Ferrer, J., Montes, F. (eds) Privacy in Statistical Databases. PSD 2018. Lecture Notes in Computer Science(), vol 11126. Springer, Cham. https://doi.org/10.1007/978-3-319-99771-1_21

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-99771-1_21

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-99770-4

  • Online ISBN: 978-3-319-99771-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics