Decision Tree-Based Anonymized Electronic Health Record Fusion for Public Health Informatics

  • Fatima KhaliqueEmail author
  • Shoab Ahmed Khan
  • Qurat-ul-ain Mubarak
  • Hasan Safdar
Conference paper
Part of the Advances in Intelligent Systems and Computing book series (AISC, volume 858)


Electronic Health Record (EHR) is frequently used in Health Information Exchanges for fusing data of same patients for public health informatics through the demographic attributes. Fusing this information across multiple health care entities presents a two-fold complexity. First the privacy constraints are stringent regarding sharing of demographic information across organizations. This requires encrypting or hashing records for anonymity. Second, the fusion of anonymized data leads to problem of finding duplicate records and linking the incoming information accurately to the existing records. This paper presents a methodology to acquire health data by the office of any public health department while preserving the privacy, integrity and usefulness of the data. Our novel duplicate detection algorithm is based on a combination of cryptographic hashing and machine learning techniques for approximate linking of patients’ records by identifying duplicate and unique records. Experimental results on three different datasets show that our proposed methodology is capable of detecting duplicates based on encoded demographic data from EHR affectively. In addition the proposed methodology can potentially be applied for record matching in other domains with encoded data.


Electronic Health Record (EHR) Demographic anonymization Duplicate detection Patient record linking Health data exchange Health data privacy Decision tree Hashing 


  1. 1.
    Menachemi, N., Collum, T.: Benefits and drawbacks of electronic health record systems. Risk Manag. Healthc. Policy 4, 47–55 (2011)CrossRefGoogle Scholar
  2. 2.
    Blumenthal, D., Tavenner, M.: The ‘Meaningful Use’ regulation for electronic health records. N. Engl. J. Med. 363(6), 501–504 (2010)CrossRefGoogle Scholar
  3. 3.
    Raghupathi, W., Raghupathi, V.: Big data analytics in healthcare: promise and potential. Heal. Inf. Sci. Syst. 2(1), 3 (2014)Google Scholar
  4. 4.
    Grande, D., Mitra, N., Shah, A., Wan, F., Asch, D.A.: Public preferences about secondary uses of electronic health information. JAMA Intern. Med. 173(19), 1798–1806 (2013)CrossRefGoogle Scholar
  5. 5.
    Centers for Medicare & Medicaid Services: The Health Insurance Portability and Accountability Act of 1996 (HIPAA) (1996)Google Scholar
  6. 6.
    Information Commissioner: Data Protection Act 1998 Legal Guidance: a reference document for organisations and their advisers that provides a broad guide to the Act as a whole. Information Commissioner’s office, Cheshire (2009)Google Scholar
  7. 7.
    European Parliament and the Council of the European Union: Directive 95/46/EC of the European Parliament and of the Council of 24 October 1995 on the protection of individuals with regard to the processing of personal data and on the free movement of such data. Off. J. Eur. Union L281, 31–50 (1995)Google Scholar
  8. 8.
    Wang, X., Ling, J.: Multiple valued logic approach for matching patient records in multiple databases. J Biomed. Inf. 45(2), 224–230 (2012)CrossRefGoogle Scholar
  9. 9.
    Bilenko, M., Mooney, R., Cohen, W., Ravikumar, P., Fienberg, S.: Adaptive name matching in information integration. IEEE Intell. Syst. 18(5), 16–23 (2003)CrossRefGoogle Scholar
  10. 10.
    Hernández, M.A., Stolfo, S.J.: Real-world data is dirty: data cleansing and the merge/purge problem. Data Min. Knowl. Discov. 2(1), 9–37 (1998)CrossRefGoogle Scholar
  11. 11.
    Elmagarmid, K., Member, S.: Duplicate record detection : a survey. IEEE Trans. Knowl. Data Eng. 19(1), 1–16 (2007)CrossRefGoogle Scholar
  12. 12.
    Rehman, M., Esichaikul, V.: Duplicate record detection for database cleansing. In: 2009 Second International Conference on Machine Vision, pp. 333–338 (2009)Google Scholar
  13. 13.
    Sorkhabi, B., Derakhshi, M.R.F., Shahamfar, H.: An algorithm for detecting similar data in replicated databases using multi criteria decision making. In: 2009 Second International Conference on Environmental and Computer Science, pp. 199–203 (2009)Google Scholar
  14. 14.
    Zhang, J.: An efficient and effective duplication detection method in large database applications. In: 2010 Fourth International Conference on Network and System Security, pp. 494–501 (2010)Google Scholar
  15. 15.
    Herschel, M.: Efficient and effective duplicate detection in hierarchical data. IEEE Trans. Knowl. Data Eng. 25(5), 1028–1041 (2013)CrossRefGoogle Scholar
  16. 16.
    Samiei, A., Naumann, F.: Cluster-based sorted neighborhood for efficient duplicate detection. In: 2016 IEEE 16th International Conference on Data Mining Workshops (ICDMW), pp. 202–209 (2016)Google Scholar
  17. 17.
    Newcombe, H.B.: Record linking: the design of efficient systems for linking records into individual and family histories. Am. J. Hum. Genet. 19(3), 335–359 (1967)Google Scholar
  18. 18.
    Wandhekar, V., Mohanpurkar, A.: Proof of duplication detection in data by applying similarity strategies. In: 2015 International Conference on Information Processing (ICIP), pp. 429–434 (2015)Google Scholar
  19. 19.
    Ektefa, M., Ibrahim, H., Memar, S.: A Threshold-based Similarity Measure for Duplicate Detection, pp. 37–41 (2011)Google Scholar
  20. 20.
    Sweeney, L.: K-anonymity: a model for protecting privacy. Int. J. Uncertain Fuzziness Knowl. Based Syst. 10(5), 557–570 (2002)MathSciNetCrossRefGoogle Scholar
  21. 21.
    Sweeney, L.: Achieving K-anonymity privacy protection using generalization and suppression. Int. J. Uncertain Fuzziness Knowl. Based Syst. 10(5), 571–588 (2002)MathSciNetCrossRefGoogle Scholar
  22. 22.
    Wong, R.C.-W., Li, J., Fu, A.W.-C., Wang, K.: (a, K)-anonymity: an enhanced K-anonymity model for privacy preserving data publishing. In: Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 754–759 (2006)Google Scholar
  23. 23.
    Loukides, G., Shao, J.: Capturing data usefulness and privacy protection in K-anonymisation. In: Proceedings of the 2007 ACM Symposium on Applied Computing, pp. 370–374 (2007)Google Scholar
  24. 24.
    Nergiz, M.E., Atzori, M., Clifton, C.: Hiding the presence of individuals from shared databases. In: Proceedings of the 2007 ACM SIGMOD International Conference on Management of Data, pp. 665–676 (2007)Google Scholar
  25. 25.
    Nergiz, M.E., Clifton, C.: Presence without complete world knowledge. IEEE Trans. Knowl. Data Eng. 22(6), 868–883 (2010)CrossRefGoogle Scholar
  26. 26.
    Gkoulalas-Divanis, A., Loukides, G., Sun, J.: Publishing data from electronic health records while preserving privacy: a survey of algorithms. J. Biomed. Inform. 50(Supplement C), 4–19 (2014)CrossRefGoogle Scholar
  27. 27.
    Cormen, T.H., Stein, C., Rivest, R.L., Leiserson, C.E.: Introduction to Algorithms, 2nd edn. McGraw-Hill Higher Education, New York (2001)Google Scholar
  28. 28.
    Handschuh, H.: SHA-0, SHA-1, SHA-2 (Secure Hash Algorithm). In: van Tilborg, H.C.A., Jajodia, S. (eds.) Encyclopedia of Cryptography and Security, 2nd edn., pp. 1190–1193. Springer, New York (2011)Google Scholar
  29. 29.
    Norouzi, M., Fleet, D.J., Salakhutdinov, R.R.: Hamming distance metric learning. In: Pereira, F., Burges, C.J.C., Bottou, L., Weinberger, K.Q. (eds.) Advances in Neural Information Processing Systems, vol. 25, pp. 1061–1069. Curran Associates, Inc. (2012)Google Scholar
  30. 30.
    Wei, M., Sung, A.H., Cather, M.E.: Improving database quality through eliminating duplicate records. Data Sci. J. 5, 127–142 (2006)CrossRefGoogle Scholar
  31. 31.
    Wright, D.: Telemedicine and developing countries - a report of Study Group 2 of the ITU Development Sector. J. Telemed. Telecare 4(Suppl 2), 1–85 (1998)Google Scholar
  32. 32.
    Winkler, W.E., Thibaudeau, Y.: An application of the Fellegi-Sunter model of record linkage to the 1990 U.S. decennial census. In: U.S. Decennial Census. Technical report, US Bureau of the Census (1987)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  • Fatima Khalique
    • 1
    Email author
  • Shoab Ahmed Khan
    • 2
  • Qurat-ul-ain Mubarak
    • 2
  • Hasan Safdar
    • 3
  1. 1.National University of Sciences and TechnologyIslamabadPakistan
  2. 2.College of Electrical and Mechanical EngineeringNUSTIslamabadPakistan
  3. 3.Center for Advanced Studies in EngineeringIslamabadPakistan

Personalised recommendations