Probabilistic Blocking with an Application to the Syrian Conflict

Steorts, Rebecca C.; Shrivastava, Anshumali

doi:10.1007/978-3-319-99771-1_21

Rebecca C. Steorts¹⁵ &
Anshumali Shrivastava¹⁶

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 11126))

Included in the following conference series:

International Conference on Privacy in Statistical Databases

831 Accesses

Abstract

Entity resolution seeks to merge databases as to remove duplicate entries where unique identifiers are typically unknown. We review modern blocking approaches for entity resolution, focusing on those based upon locality sensitive hashing (LSH). First, we introduce k-means locality sensitive hashing (KLSH), which is based upon the information retrieval literature and clusters similar records into blocks using a vector-space representation and projections. Second, we introduce a subquadratic variant of LSH to the literature, known as Densified One Permutation Hashing (DOPH). Third, we propose a weighted variant of DOPH. We illustrate each method on an application to a subset of the ongoing Syrian conflict, giving a discussion of each method.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 54.99; Price excludes VAT (USA)

Softcover Book: USD 69.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
In this paper, we utilize a shingling based approach, and thus, our representation of each record is likely to be very sparse. Moreover, [23] showed that minhashing based approaches are superior compared to random projection based approaches for very sparse data sets.
2.
The assumption holds when dealing with floating point numbers for small enough \(\delta \).
3.
Note that the precision for a blocking procedure is not expected to be high since we are only placing similar pair in the same block (not fully running an entity resolution procedure or de-duplication procedure, which would try and maximize both the recall and the precision).

References

Broder, A.Z.: On the resemblance and containment of documents. In: Proceedings of 1997 Compression and Complexity of Sequences. IEEE, pp. 21–29 (1997)
Google Scholar
Charikar, M., Chen, K., Farach-Colton, M.: Finding frequent items in data streams. In: Widmayer, P., Eidenbenz, S., Triguero, F., Morales, R., Conejo, R., Hennessy, M. (eds.) ICALP 2002. LNCS, vol. 2380, pp. 693–703. Springer, Heidelberg (2002). https://doi.org/10.1007/3-540-45465-9_59
Chapter Google Scholar
Christen, P.: A survey of indexing techniques for scalable record linkage and deduplication. IEEE Trans. Knowl. Data Eng. 24, 1537–1555 (2012)
Article Google Scholar
Christen, P., Gayler, R., Hawking, D.: Similarity-aware indexing for real-time entity resolution. In: Proceedings of the 18th ACM Conference on Information and Knowledge Management, pp. 1565–1568 (2009)
Google Scholar
Gionis, A., Indyk, P., Motwani, R., et al.: Similarity search in high dimensions via hashing. In: Very Large Data Bases (VLDB), vol. 99, pp. 518–529 (1999)
Google Scholar
Gollapudi, S., Panigrahy, R.: Exploiting asymmetry in hierarchical topic extraction. In: Proceedings of the 15th ACM International Conference on Information and Knowledge Management. ACM, pp. 475–482 (2006)
Google Scholar
Haeupler, B., Manasse, M., Talwar, K.: Consistent weighted sampling made fast, small, and easy. Technical report (2014). arXiv:1410.4266
Herzog, T., Scheuren, F., Winkler, W.: Data Quality and Record Linkage Techniques. Springer, New York (2007). https://doi.org/10.1007/0-387-69505-2
Book MATH Google Scholar
Herzog, T., Scheuren, F., Winkler, W.: Record linkage. Wiley Interdisc. Rev.: Comput. Stat. 2 (2010). https://doi.org/10.1002/wics.108
Indyk, P., Motwani, R.: Approximate nearest neighbors: towards removing the curse of dimensionality. In: STOC, Dallas, TX, pp. 604–613 (1998)
Google Scholar
Ioffe, S.: Improved consistent sampling, weighted minhash and L1 sketching. In: ICDM, Sydney, AU, pp. 246–255 (2010)
Google Scholar
McCallum, A., Nigam, K., Ungar, L.H.: Efficient clustering of high-dimensional data sets with application to reference matching. In: Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 169–178. ACM (2000)
Google Scholar
Paulevé, L., Jégou, H., Amsaleg, L.: Locality sensitive hashing: a comparison of hash function types and querying mechanisms. Pattern Recogn. Lett. 31, 1348–1358 (2010)
Article Google Scholar
Price, M., Ball, P.: The limits of observation for understanding mass violence. Can. J. Law Soc./Revue Canadienne Droit et Société 30, 237–257 (2015a)
Article Google Scholar
Price, M., Ball, P.: Selection bias and the statistical patterns of mortality in conflict. Stat. J. IAOS 31, 263–272 (2015b)
Article Google Scholar
Price, M., Gohdes, A., Ball, P.: Documents of war: understanding the Syrian conflict. Significance 12, 14–19 (2015)
Article Google Scholar
Price, M., Klingner, J., Qtiesh, A., Ball, P.: Updated statistical analysis of documentation of killings in the Syrian Arab Republic. United Nations Office of the UN High Commissioner for Human Rights (2013)
Google Scholar
Price, M., Klingner, J., Qtiesh, A., Ball, P.: Updated statistical analysis of documentation of killings in the Syrian Arab Republic. United Nations Office of the UN High Commissioner for Human Rights (2014)
Google Scholar
Rajaraman, A., Ullman, J.D.: Mining of Massive Datasets. Cambridge University Press, Cambridge (2012)
Google Scholar
Sadosky, P., Shrivastava, A., Price, M., Steorts, R.C.: Blocking methods applied to casualty records from the syrian conflict (2015). arXiv preprint arXiv:1510.07714
Shrivastava, A., Li, P.: Densifying one permutation hashing via rotation for fast near neighbor search. In: Proceedings of The 31st International Conference on Machine Learning, pp. 557–565 (2014)
Google Scholar
Shrivastava, A., Li, P.: Improved densification of one permutation hashing. In: Proceedings of the 30th Conference on Uncertainty in Artificial Intelligence (2014)
Google Scholar
Shrivastava, A., Li, P.: In defense of minhash over simhash. In: Proceedings of the Seventeenth International Conference on Artificial Intelligence and Statistics, pp. 886–894 (2014)
Google Scholar
Steorts, R.C., Ventura, S.L., Sadinle, M., Fienberg, S.E.: A comparison of blocking methods for record linkage. In: Domingo-Ferrer, J. (ed.) PSD 2014. LNCS, vol. 8744, pp. 253–268. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-11257-2_20
Chapter Google Scholar
Vatsalan, D., Christen, P., O’Keefe, C.M., Verykios, V.S.: An evaluation framework for privacy-preserving record linkage. J. Privacy Confidentiality 6, 3 (2014)
Article Google Scholar
Winkler, W.E.: Overview of record linkage and current research directions. Technical report, U.S. Bureau of the Census Statistical Research Division (2006)
Google Scholar

Download references

Acknowledgments

We would like to thank HRDAG for providing the data and for helpful conversations. We would also like to thank Stephen E. Fienberg and Lars Vilhuber for making this collaboration possible. Steorts’s work is supported by NSF-1652431 and NSF-1534412. Shrivastava’s work is supported by NSF-1652131 and NSF-1718478. This work is representative of the author’s alone and not of the funding organizations.

Author information

Authors and Affiliations

Department of Statistical Science, Affiliated Faculty, Computer Science, Biostatistics and Bioinformatics, the Information Initiative at Duke (iiD), and the Social Science Research Institute (SSRI), Duke University, Durham, USA
Rebecca C. Steorts
Department of Computer Science, Rice University, Houston, USA
Anshumali Shrivastava

Authors

Rebecca C. Steorts
View author publications
You can also search for this author in PubMed Google Scholar
Anshumali Shrivastava
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Rebecca C. Steorts .

Editor information

Editors and Affiliations

Rovira i Virgili University, Tarragona, Spain
Josep Domingo-Ferrer
University of Valencia, Burjassot, Spain
Francisco Montes

A Syrian Data Set

In this section, we provide a more detailed description about the Syrian data set. As already mentioned, via collaboration with the Human Rights Data Analysis Group (HRDAG), we have access to four databases. They come from the Violation Documentation Centre (VDC), Syrian Center for Statistics and Research (CSR-SY), Syrian Network for Human Rights (SNHR), and Syria Shuhada website (SS). Each database lists each victim killed in the Syrian conflict, along with identifying information about each person (see [17] for further details).

Data collection by these organizations is carried out in a variety of ways. Three of the groups (VDC, CSR-SY, and SNHR) have trusted networks on the ground in Syria. These networks collect as much information as possible about the victims. For example, information is collected through direct community contacts. Sometimes information comes from a victim’s friends or family members. Other times, information comes from religious leaders, hospitals, or morgue records. These networks also verify information collected via social and traditional media sources. The fourth source, SS, aggregates records from multiple other sources, including NGOs and social and traditional media sources (see http://syrianshuhada.com/ for information about specific sources).

These lists, despite being products of extremely careful, systematic data collection, are not probabilistic samples [14,15,16, 18]. Thus, these lists cannot be assumed to represent the underlying population of all victims of conflict violence. Records collected by each source are subject to biases, stemming from a number of potential causes, including a group’s relationship within a community, resource availability, and the current security situation.

1.1 A.1 Syrian Handmatched Data Set

We describe how HRDAG’s training data on the Syrian data set was created, which we use in our paper.

First, all documented deaths recorded by any of the documentation groups were concatenated together into a single list. From this list, records were broadly grouped according to governorate and year. In other words, all killings recorded in Homs in 2011 were examined as a group, looking for records with similar names and dates.

Next, several experts review these “blocks”, sometimes organized as pairs for comparison and other times organized as entire spreadsheets for review. These experts determine whether pairs or groups of records refer to the same individual victim or not. Pairs or groups of records determined to refer to the same individual are assigned to the same “match group.” All of the records contributing to a single “match group” are then combined into a single record. This new single record is then again examined as a pair or group with other records, in an iterative process.

For example, two records with the same name, date, and location may be identified as referring to the same individual, and combined into a single record. In a second review process, it may be found that record also matches the name and location, but not date, of a third record. The third record may list a date one week later than the two initial records, but still be determined to refer to the same individual. In this second pass, information from this third record will also be included in the single combined record.

When records are combined, the most precise information available from each of the individual records is kept. If some records contain contradictory information (for example, if records A and B record the victim as age 19 and record C records age 20) the most frequently reported information is used (in this case, age 19). If the same number of records report each piece of contradictory information, a value from the contradictory set is randomly selected.

Three of the experts are native Arabic speakers; they review records with the original Arabic content. Two of the experts review records translated into English. These five experts review overlapping sets of records, meaning that some records are evaluated by two, three, four, or all five of the experts. This makes it possible to check the consistency of the reviewers, to ensure that they are each reaching comparable decisions regarding whether two (or more) records refer to the same individual or not.

After an initial round of clustering, subsets of these combined records were then re-examined to identify previously missed groups of records that refer to the same individual, particularly across years (e.g., records with dates of death 2011/12/31 and 2012/01/01 might refer to the same individual) and governorates (e.g., records with neighboring locations of death might refer to the same individual).

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Steorts, R.C., Shrivastava, A. (2018). Probabilistic Blocking with an Application to the Syrian Conflict. In: Domingo-Ferrer, J., Montes, F. (eds) Privacy in Statistical Databases. PSD 2018. Lecture Notes in Computer Science(), vol 11126. Springer, Cham. https://doi.org/10.1007/978-3-319-99771-1_21

Download citation

DOI: https://doi.org/10.1007/978-3-319-99771-1_21
Published: 25 August 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-99770-4
Online ISBN: 978-3-319-99771-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Probabilistic Blocking with an Application to the Syrian Conflict

Abstract

Access this chapter

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

A Syrian Data Set

A Syrian Data Set

1.1 A.1 Syrian Handmatched Data Set

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation