Probabilistic Blocking with an Application to the Syrian Conflict

  • Rebecca C. SteortsEmail author
  • Anshumali Shrivastava
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11126)


Entity resolution seeks to merge databases as to remove duplicate entries where unique identifiers are typically unknown. We review modern blocking approaches for entity resolution, focusing on those based upon locality sensitive hashing (LSH). First, we introduce k-means locality sensitive hashing (KLSH), which is based upon the information retrieval literature and clusters similar records into blocks using a vector-space representation and projections. Second, we introduce a subquadratic variant of LSH to the literature, known as Densified One Permutation Hashing (DOPH). Third, we propose a weighted variant of DOPH. We illustrate each method on an application to a subset of the ongoing Syrian conflict, giving a discussion of each method.



We would like to thank HRDAG for providing the data and for helpful conversations. We would also like to thank Stephen E. Fienberg and Lars Vilhuber for making this collaboration possible. Steorts’s work is supported by NSF-1652431 and NSF-1534412. Shrivastava’s work is supported by NSF-1652131 and NSF-1718478. This work is representative of the author’s alone and not of the funding organizations.


  1. 1.
    Broder, A.Z.: On the resemblance and containment of documents. In: Proceedings of 1997 Compression and Complexity of Sequences. IEEE, pp. 21–29 (1997)Google Scholar
  2. 2.
    Charikar, M., Chen, K., Farach-Colton, M.: Finding frequent items in data streams. In: Widmayer, P., Eidenbenz, S., Triguero, F., Morales, R., Conejo, R., Hennessy, M. (eds.) ICALP 2002. LNCS, vol. 2380, pp. 693–703. Springer, Heidelberg (2002). Scholar
  3. 3.
    Christen, P.: A survey of indexing techniques for scalable record linkage and deduplication. IEEE Trans. Knowl. Data Eng. 24, 1537–1555 (2012)CrossRefGoogle Scholar
  4. 4.
    Christen, P., Gayler, R., Hawking, D.: Similarity-aware indexing for real-time entity resolution. In: Proceedings of the 18th ACM Conference on Information and Knowledge Management, pp. 1565–1568 (2009)Google Scholar
  5. 5.
    Gionis, A., Indyk, P., Motwani, R., et al.: Similarity search in high dimensions via hashing. In: Very Large Data Bases (VLDB), vol. 99, pp. 518–529 (1999)Google Scholar
  6. 6.
    Gollapudi, S., Panigrahy, R.: Exploiting asymmetry in hierarchical topic extraction. In: Proceedings of the 15th ACM International Conference on Information and Knowledge Management. ACM, pp. 475–482 (2006)Google Scholar
  7. 7.
    Haeupler, B., Manasse, M., Talwar, K.: Consistent weighted sampling made fast, small, and easy. Technical report (2014). arXiv:1410.4266
  8. 8.
    Herzog, T., Scheuren, F., Winkler, W.: Data Quality and Record Linkage Techniques. Springer, New York (2007). Scholar
  9. 9.
    Herzog, T., Scheuren, F., Winkler, W.: Record linkage. Wiley Interdisc. Rev.: Comput. Stat. 2 (2010).
  10. 10.
    Indyk, P., Motwani, R.: Approximate nearest neighbors: towards removing the curse of dimensionality. In: STOC, Dallas, TX, pp. 604–613 (1998)Google Scholar
  11. 11.
    Ioffe, S.: Improved consistent sampling, weighted minhash and L1 sketching. In: ICDM, Sydney, AU, pp. 246–255 (2010)Google Scholar
  12. 12.
    McCallum, A., Nigam, K., Ungar, L.H.: Efficient clustering of high-dimensional data sets with application to reference matching. In: Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 169–178. ACM (2000)Google Scholar
  13. 13.
    Paulevé, L., Jégou, H., Amsaleg, L.: Locality sensitive hashing: a comparison of hash function types and querying mechanisms. Pattern Recogn. Lett. 31, 1348–1358 (2010)CrossRefGoogle Scholar
  14. 14.
    Price, M., Ball, P.: The limits of observation for understanding mass violence. Can. J. Law Soc./Revue Canadienne Droit et Société 30, 237–257 (2015a)CrossRefGoogle Scholar
  15. 15.
    Price, M., Ball, P.: Selection bias and the statistical patterns of mortality in conflict. Stat. J. IAOS 31, 263–272 (2015b)CrossRefGoogle Scholar
  16. 16.
    Price, M., Gohdes, A., Ball, P.: Documents of war: understanding the Syrian conflict. Significance 12, 14–19 (2015)CrossRefGoogle Scholar
  17. 17.
    Price, M., Klingner, J., Qtiesh, A., Ball, P.: Updated statistical analysis of documentation of killings in the Syrian Arab Republic. United Nations Office of the UN High Commissioner for Human Rights (2013)Google Scholar
  18. 18.
    Price, M., Klingner, J., Qtiesh, A., Ball, P.: Updated statistical analysis of documentation of killings in the Syrian Arab Republic. United Nations Office of the UN High Commissioner for Human Rights (2014)Google Scholar
  19. 19.
    Rajaraman, A., Ullman, J.D.: Mining of Massive Datasets. Cambridge University Press, Cambridge (2012)Google Scholar
  20. 20.
    Sadosky, P., Shrivastava, A., Price, M., Steorts, R.C.: Blocking methods applied to casualty records from the syrian conflict (2015). arXiv preprint arXiv:1510.07714
  21. 21.
    Shrivastava, A., Li, P.: Densifying one permutation hashing via rotation for fast near neighbor search. In: Proceedings of The 31st International Conference on Machine Learning, pp. 557–565 (2014)Google Scholar
  22. 22.
    Shrivastava, A., Li, P.: Improved densification of one permutation hashing. In: Proceedings of the 30th Conference on Uncertainty in Artificial Intelligence (2014)Google Scholar
  23. 23.
    Shrivastava, A., Li, P.: In defense of minhash over simhash. In: Proceedings of the Seventeenth International Conference on Artificial Intelligence and Statistics, pp. 886–894 (2014)Google Scholar
  24. 24.
    Steorts, R.C., Ventura, S.L., Sadinle, M., Fienberg, S.E.: A comparison of blocking methods for record linkage. In: Domingo-Ferrer, J. (ed.) PSD 2014. LNCS, vol. 8744, pp. 253–268. Springer, Cham (2014). Scholar
  25. 25.
    Vatsalan, D., Christen, P., O’Keefe, C.M., Verykios, V.S.: An evaluation framework for privacy-preserving record linkage. J. Privacy Confidentiality 6, 3 (2014)CrossRefGoogle Scholar
  26. 26.
    Winkler, W.E.: Overview of record linkage and current research directions. Technical report, U.S. Bureau of the Census Statistical Research Division (2006)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  1. 1.Department of Statistical Science, Affiliated Faculty, Computer Science, Biostatistics and Bioinformatics, the Information Initiative at Duke (iiD), and the Social Science Research Institute (SSRI)Duke UniversityDurhamUSA
  2. 2.Department of Computer ScienceRice UniversityHoustonUSA

Personalised recommendations