Abstract
Entity resolution seeks to merge databases as to remove duplicate entries where unique identifiers are typically unknown. We review modern blocking approaches for entity resolution, focusing on those based upon locality sensitive hashing (LSH). First, we introduce k-means locality sensitive hashing (KLSH), which is based upon the information retrieval literature and clusters similar records into blocks using a vector-space representation and projections. Second, we introduce a subquadratic variant of LSH to the literature, known as Densified One Permutation Hashing (DOPH). Third, we propose a weighted variant of DOPH. We illustrate each method on an application to a subset of the ongoing Syrian conflict, giving a discussion of each method.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
In this paper, we utilize a shingling based approach, and thus, our representation of each record is likely to be very sparse. Moreover, [23] showed that minhashing based approaches are superior compared to random projection based approaches for very sparse data sets.
- 2.
The assumption holds when dealing with floating point numbers for small enough \(\delta \).
- 3.
Note that the precision for a blocking procedure is not expected to be high since we are only placing similar pair in the same block (not fully running an entity resolution procedure or de-duplication procedure, which would try and maximize both the recall and the precision).
References
Broder, A.Z.: On the resemblance and containment of documents. In: Proceedings of 1997 Compression and Complexity of Sequences. IEEE, pp. 21–29 (1997)
Charikar, M., Chen, K., Farach-Colton, M.: Finding frequent items in data streams. In: Widmayer, P., Eidenbenz, S., Triguero, F., Morales, R., Conejo, R., Hennessy, M. (eds.) ICALP 2002. LNCS, vol. 2380, pp. 693–703. Springer, Heidelberg (2002). https://doi.org/10.1007/3-540-45465-9_59
Christen, P.: A survey of indexing techniques for scalable record linkage and deduplication. IEEE Trans. Knowl. Data Eng. 24, 1537–1555 (2012)
Christen, P., Gayler, R., Hawking, D.: Similarity-aware indexing for real-time entity resolution. In: Proceedings of the 18th ACM Conference on Information and Knowledge Management, pp. 1565–1568 (2009)
Gionis, A., Indyk, P., Motwani, R., et al.: Similarity search in high dimensions via hashing. In: Very Large Data Bases (VLDB), vol. 99, pp. 518–529 (1999)
Gollapudi, S., Panigrahy, R.: Exploiting asymmetry in hierarchical topic extraction. In: Proceedings of the 15th ACM International Conference on Information and Knowledge Management. ACM, pp. 475–482 (2006)
Haeupler, B., Manasse, M., Talwar, K.: Consistent weighted sampling made fast, small, and easy. Technical report (2014). arXiv:1410.4266
Herzog, T., Scheuren, F., Winkler, W.: Data Quality and Record Linkage Techniques. Springer, New York (2007). https://doi.org/10.1007/0-387-69505-2
Herzog, T., Scheuren, F., Winkler, W.: Record linkage. Wiley Interdisc. Rev.: Comput. Stat. 2 (2010). https://doi.org/10.1002/wics.108
Indyk, P., Motwani, R.: Approximate nearest neighbors: towards removing the curse of dimensionality. In: STOC, Dallas, TX, pp. 604–613 (1998)
Ioffe, S.: Improved consistent sampling, weighted minhash and L1 sketching. In: ICDM, Sydney, AU, pp. 246–255 (2010)
McCallum, A., Nigam, K., Ungar, L.H.: Efficient clustering of high-dimensional data sets with application to reference matching. In: Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 169–178. ACM (2000)
Paulevé, L., Jégou, H., Amsaleg, L.: Locality sensitive hashing: a comparison of hash function types and querying mechanisms. Pattern Recogn. Lett. 31, 1348–1358 (2010)
Price, M., Ball, P.: The limits of observation for understanding mass violence. Can. J. Law Soc./Revue Canadienne Droit et Société 30, 237–257 (2015a)
Price, M., Ball, P.: Selection bias and the statistical patterns of mortality in conflict. Stat. J. IAOS 31, 263–272 (2015b)
Price, M., Gohdes, A., Ball, P.: Documents of war: understanding the Syrian conflict. Significance 12, 14–19 (2015)
Price, M., Klingner, J., Qtiesh, A., Ball, P.: Updated statistical analysis of documentation of killings in the Syrian Arab Republic. United Nations Office of the UN High Commissioner for Human Rights (2013)
Price, M., Klingner, J., Qtiesh, A., Ball, P.: Updated statistical analysis of documentation of killings in the Syrian Arab Republic. United Nations Office of the UN High Commissioner for Human Rights (2014)
Rajaraman, A., Ullman, J.D.: Mining of Massive Datasets. Cambridge University Press, Cambridge (2012)
Sadosky, P., Shrivastava, A., Price, M., Steorts, R.C.: Blocking methods applied to casualty records from the syrian conflict (2015). arXiv preprint arXiv:1510.07714
Shrivastava, A., Li, P.: Densifying one permutation hashing via rotation for fast near neighbor search. In: Proceedings of The 31st International Conference on Machine Learning, pp. 557–565 (2014)
Shrivastava, A., Li, P.: Improved densification of one permutation hashing. In: Proceedings of the 30th Conference on Uncertainty in Artificial Intelligence (2014)
Shrivastava, A., Li, P.: In defense of minhash over simhash. In: Proceedings of the Seventeenth International Conference on Artificial Intelligence and Statistics, pp. 886–894 (2014)
Steorts, R.C., Ventura, S.L., Sadinle, M., Fienberg, S.E.: A comparison of blocking methods for record linkage. In: Domingo-Ferrer, J. (ed.) PSD 2014. LNCS, vol. 8744, pp. 253–268. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-11257-2_20
Vatsalan, D., Christen, P., O’Keefe, C.M., Verykios, V.S.: An evaluation framework for privacy-preserving record linkage. J. Privacy Confidentiality 6, 3 (2014)
Winkler, W.E.: Overview of record linkage and current research directions. Technical report, U.S. Bureau of the Census Statistical Research Division (2006)
Acknowledgments
We would like to thank HRDAG for providing the data and for helpful conversations. We would also like to thank Stephen E. Fienberg and Lars Vilhuber for making this collaboration possible. Steorts’s work is supported by NSF-1652431 and NSF-1534412. Shrivastava’s work is supported by NSF-1652131 and NSF-1718478. This work is representative of the author’s alone and not of the funding organizations.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
A Syrian Data Set
A Syrian Data Set
In this section, we provide a more detailed description about the Syrian data set. As already mentioned, via collaboration with the Human Rights Data Analysis Group (HRDAG), we have access to four databases. They come from the Violation Documentation Centre (VDC), Syrian Center for Statistics and Research (CSR-SY), Syrian Network for Human Rights (SNHR), and Syria Shuhada website (SS). Each database lists each victim killed in the Syrian conflict, along with identifying information about each person (see [17] for further details).
Data collection by these organizations is carried out in a variety of ways. Three of the groups (VDC, CSR-SY, and SNHR) have trusted networks on the ground in Syria. These networks collect as much information as possible about the victims. For example, information is collected through direct community contacts. Sometimes information comes from a victim’s friends or family members. Other times, information comes from religious leaders, hospitals, or morgue records. These networks also verify information collected via social and traditional media sources. The fourth source, SS, aggregates records from multiple other sources, including NGOs and social and traditional media sources (see http://syrianshuhada.com/ for information about specific sources).
These lists, despite being products of extremely careful, systematic data collection, are not probabilistic samples [14,15,16, 18]. Thus, these lists cannot be assumed to represent the underlying population of all victims of conflict violence. Records collected by each source are subject to biases, stemming from a number of potential causes, including a group’s relationship within a community, resource availability, and the current security situation.
1.1 A.1 Syrian Handmatched Data Set
We describe how HRDAG’s training data on the Syrian data set was created, which we use in our paper.
First, all documented deaths recorded by any of the documentation groups were concatenated together into a single list. From this list, records were broadly grouped according to governorate and year. In other words, all killings recorded in Homs in 2011 were examined as a group, looking for records with similar names and dates.
Next, several experts review these “blocks”, sometimes organized as pairs for comparison and other times organized as entire spreadsheets for review. These experts determine whether pairs or groups of records refer to the same individual victim or not. Pairs or groups of records determined to refer to the same individual are assigned to the same “match group.” All of the records contributing to a single “match group” are then combined into a single record. This new single record is then again examined as a pair or group with other records, in an iterative process.
For example, two records with the same name, date, and location may be identified as referring to the same individual, and combined into a single record. In a second review process, it may be found that record also matches the name and location, but not date, of a third record. The third record may list a date one week later than the two initial records, but still be determined to refer to the same individual. In this second pass, information from this third record will also be included in the single combined record.
When records are combined, the most precise information available from each of the individual records is kept. If some records contain contradictory information (for example, if records A and B record the victim as age 19 and record C records age 20) the most frequently reported information is used (in this case, age 19). If the same number of records report each piece of contradictory information, a value from the contradictory set is randomly selected.
Three of the experts are native Arabic speakers; they review records with the original Arabic content. Two of the experts review records translated into English. These five experts review overlapping sets of records, meaning that some records are evaluated by two, three, four, or all five of the experts. This makes it possible to check the consistency of the reviewers, to ensure that they are each reaching comparable decisions regarding whether two (or more) records refer to the same individual or not.
After an initial round of clustering, subsets of these combined records were then re-examined to identify previously missed groups of records that refer to the same individual, particularly across years (e.g., records with dates of death 2011/12/31 and 2012/01/01 might refer to the same individual) and governorates (e.g., records with neighboring locations of death might refer to the same individual).
Rights and permissions
Copyright information
© 2018 Springer Nature Switzerland AG
About this paper
Cite this paper
Steorts, R.C., Shrivastava, A. (2018). Probabilistic Blocking with an Application to the Syrian Conflict. In: Domingo-Ferrer, J., Montes, F. (eds) Privacy in Statistical Databases. PSD 2018. Lecture Notes in Computer Science(), vol 11126. Springer, Cham. https://doi.org/10.1007/978-3-319-99771-1_21
Download citation
DOI: https://doi.org/10.1007/978-3-319-99771-1_21
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-99770-4
Online ISBN: 978-3-319-99771-1
eBook Packages: Computer ScienceComputer Science (R0)