Skip to main content

Entity Resolution-Based Jaccard Similarity Coefficient for Heterogeneous Distributed Databases

  • Conference paper
  • First Online:

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 379))

Abstract

Entity Resolution (ER) is a task for identifying same real world entity. It refers to data object matching or deduplication. It has been a leading research in the field of structure database. Due to its significance, entity resolution continues to be a most important challenge for heterogeneous distributed databases. Several methods have been proposed for the Entity resolution, but they have yielded unsatisfactory results. In this paper, we propose an efficient integrated solution to the entity resolution problem based on Jaccard similarity coefficient. Here we use Markov logic and Jaccard similarity coefficient for providing an efficient solution towards ER problem in heterogeneous distributed databases. The approach that we have implemented gives an overall success rate of about 98 %, thus proving better than the previously implemented algorithms.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   169.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   219.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

  1. Weis, M., Naumann, F., Brosy, F.: A duplicate detection benchmark for XML (and relational) data. In: Proceedings of Workshop on Information Quality for Information Systems (IQIS) (2006)

    Google Scholar 

  2. Re, C., Dalvi, N., Suciu, D.: Efficient top-k query evaluation on probabilistic data. In: IEEE 23rd International Conference on Data Engineering, 2007, ICDE 2007, pp. 886–895. IEEE (2007)

    Google Scholar 

  3. Panse, F., Van Keulen, M., De Keijzer, A., Ritter, N.: Duplicate detection in probabilistic data. In: 2010 IEEE 26th International Conference on Data Engineering Workshops (ICDEW), pp. 179–182. IEEE (2010)

    Google Scholar 

  4. Kopcke, H., Rahm, E.: Frameworks for entity matching: a comparison. Data Knowl. Eng. 69(2), 197–210 (2010)

    Article  Google Scholar 

  5. Xiao, C., Wang, W., Lin, X., Yu, J.X., Wang, G.: Efficient similarity joins for near-duplicate detection. ACM Trans. Database Syst. (TODS) 36(3), 15 (2011)

    Article  Google Scholar 

  6. Bilenko, M., Mooney, R.J.: Adaptive duplicate detection using learnable string similarity measures. In: Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 39–48. ACM (2003)

    Google Scholar 

  7. Bhattacharya, I., Getoor, L.: Iterative record linkage for cleaning and integration. In: Proceedings of the 9th ACM SIGMOD workshop on Research issues in data mining and knowledge discovery, pp. 11–18. ACM (2004)

    Google Scholar 

  8. Ayat, N., Akbarinia, R., Afsarmanesh, H., Valduriez, P.: Entity resolution for probabilistic data. Inf. Sci. 277, 492–511 (2014)

    Article  MathSciNet  Google Scholar 

  9. Schewe, K.D., Wang, Q.: A theoretical framework for knowledge-based entity resolution. Theoret. Comput. Sci. 549, 101–126 (2014)

    Article  MATH  MathSciNet  Google Scholar 

  10. Suciu, D., Connolly, A.J., Howe, B.: Embracing uncertainty in large-scale computational astrophysics. In: MUD, pp. 63–77 (2009)

    Google Scholar 

  11. Soliman, M.A., Ilyas, I.F., Chen-Chuan Chang, K.: Top-k query processing in uncertain databases. In: IEEE 23rd International Conference onData Engineering, 2007. ICDE 2007, pp. 896–905. IEEE (2007)

    Google Scholar 

  12. Singla, P., Domingos, P.: Discriminative training of Markov logic networks. In: AAAI, vol. 5, pp. 868–873 (2005)

    Google Scholar 

  13. Hassanzadeh, O., Chiang, F., Lee, H.C., Miller, R.J.: Framework for evaluating clustering algorithms in duplicate detection. Proc. VLDB Endowment 2(1), 1282–1293 (2009)

    Article  Google Scholar 

  14. Baxter, R., Christen, P., Churches, T.: A comparison of fast blocking methods for record linkage. In: ACM SIGKDD, vol. 3, pp. 25–27 (2003)

    Google Scholar 

  15. Kopcke, H., Thor, A., Rahm, E.: Learning-based approaches for matching web data entities. IEEE Internet Comput. 14(4), 23–31 (2010)

    Article  Google Scholar 

  16. Kopcke, H., Rahm, E.: Training selection for tuning entity matching. In: QDB/MUD, pp. 3–12 (2008)

    Google Scholar 

  17. Singla, P., Domingos, P.: Entity resolution with markov logic. In: Sixth International Conference on Data Mining, 2006. ICDM’06, pp. 572–582. IEEE (2006)

    Google Scholar 

  18. Kok, S., Domingos, P.: Learning the structure of Markov logic networks. In: Proceedings of the 22nd International Conference on Machine Learning, pp. 441–448. ACM (2005)

    Google Scholar 

  19. Ayat, N., Akbarinia, R., Afsarmanesh, H., Valduriez, P.: Entity resolution for uncertain data. In: BDA’2012: 28e Journées Bases de Données Avancées, p. 20 (2002)

    Google Scholar 

  20. Das Sarma, A., Benjelloun, O., Halevy, A., Widom, J.: Working models for uncertain data. In: Proceedings of the 22nd International Conference on Data Engineering, 2006, ICDE’06, pp. 7–7. IEEE (2006)

    Google Scholar 

  21. Cohen, W., Ravikumar, P., Fienberg, S.: A comparison of string metrics for matching names and records. In: Kdd Workshop on Data Cleaning and Object Consolidation, vol. 3, pp. 73–78 (2003)

    Google Scholar 

  22. Yi, K., Li, F., Kollios, G., Srivastava, D.: Efficient processing of top-k queries in uncertain databases with x-relations. IEEE Trans. Knowl. Data Eng. 20(12), 1669–1682 (2008)

    Article  Google Scholar 

  23. Yuen, S.M., Tao, Y., Xiao, X., Pei, J., Zhang, D.: Superseding nearest neighbor search on uncertain spatial databases. IEEE Trans. Knowl. Data Eng. 22(7), 1041–1055 (2010)

    Article  Google Scholar 

  24. Peng, L., Diao, Y., Liu, A.: Optimizing probabilistic query processing on continuous uncertain data. Proc. VLDB Endowment 4(11), 1169–1180 (2011)

    Google Scholar 

  25. McCallum, A., Wellner, B.: Object consolidation by graph partitioning with a conditionally-trained distance metric. In: KDD Workshop on Data Cleaning, Record Linkage and Object Consolidation (2003)

    Google Scholar 

  26. Bilenko, M., Mooney, R., Cohen, W., Ravikumar, P., Fienberg, S.: Adaptive name matching in information integration. IEEE Intell. Syst. 18(5), 16–23 (2003)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ramesh Dharavath .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer India

About this paper

Cite this paper

Dharavath, R., Singh, A.K. (2016). Entity Resolution-Based Jaccard Similarity Coefficient for Heterogeneous Distributed Databases. In: Satapathy, S., Raju, K., Mandal, J., Bhateja, V. (eds) Proceedings of the Second International Conference on Computer and Communication Technologies. Advances in Intelligent Systems and Computing, vol 379. Springer, New Delhi. https://doi.org/10.1007/978-81-322-2517-1_48

Download citation

  • DOI: https://doi.org/10.1007/978-81-322-2517-1_48

  • Published:

  • Publisher Name: Springer, New Delhi

  • Print ISBN: 978-81-322-2516-4

  • Online ISBN: 978-81-322-2517-1

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics