Entity Resolution-Based Jaccard Similarity Coefficient for Heterogeneous Distributed Databases

Dharavath, Ramesh; Singh, Abhishek Kumar

doi:10.1007/978-81-322-2517-1_48

Entity Resolution-Based Jaccard Similarity Coefficient for Heterogeneous Distributed Databases

Ramesh Dharavath⁶ &
Abhishek Kumar Singh⁶

Conference paper
First Online: 01 January 2015

1342 Accesses
8 Citations

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 379))

Abstract

Entity Resolution (ER) is a task for identifying same real world entity. It refers to data object matching or deduplication. It has been a leading research in the field of structure database. Due to its significance, entity resolution continues to be a most important challenge for heterogeneous distributed databases. Several methods have been proposed for the Entity resolution, but they have yielded unsatisfactory results. In this paper, we propose an efficient integrated solution to the entity resolution problem based on Jaccard similarity coefficient. Here we use Markov logic and Jaccard similarity coefficient for providing an efficient solution towards ER problem in heterogeneous distributed databases. The approach that we have implemented gives an overall success rate of about 98 %, thus proving better than the previously implemented algorithms.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 169.00; Price excludes VAT (USA)

Softcover Book: USD 219.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

Weis, M., Naumann, F., Brosy, F.: A duplicate detection benchmark for XML (and relational) data. In: Proceedings of Workshop on Information Quality for Information Systems (IQIS) (2006)
Google Scholar
Re, C., Dalvi, N., Suciu, D.: Efficient top-k query evaluation on probabilistic data. In: IEEE 23rd International Conference on Data Engineering, 2007, ICDE 2007, pp. 886–895. IEEE (2007)
Google Scholar
Panse, F., Van Keulen, M., De Keijzer, A., Ritter, N.: Duplicate detection in probabilistic data. In: 2010 IEEE 26th International Conference on Data Engineering Workshops (ICDEW), pp. 179–182. IEEE (2010)
Google Scholar
Kopcke, H., Rahm, E.: Frameworks for entity matching: a comparison. Data Knowl. Eng. 69(2), 197–210 (2010)
Article Google Scholar
Xiao, C., Wang, W., Lin, X., Yu, J.X., Wang, G.: Efficient similarity joins for near-duplicate detection. ACM Trans. Database Syst. (TODS) 36(3), 15 (2011)
Article Google Scholar
Bilenko, M., Mooney, R.J.: Adaptive duplicate detection using learnable string similarity measures. In: Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 39–48. ACM (2003)
Google Scholar
Bhattacharya, I., Getoor, L.: Iterative record linkage for cleaning and integration. In: Proceedings of the 9th ACM SIGMOD workshop on Research issues in data mining and knowledge discovery, pp. 11–18. ACM (2004)
Google Scholar
Ayat, N., Akbarinia, R., Afsarmanesh, H., Valduriez, P.: Entity resolution for probabilistic data. Inf. Sci. 277, 492–511 (2014)
Article MathSciNet Google Scholar
Schewe, K.D., Wang, Q.: A theoretical framework for knowledge-based entity resolution. Theoret. Comput. Sci. 549, 101–126 (2014)
Article MATH MathSciNet Google Scholar
Suciu, D., Connolly, A.J., Howe, B.: Embracing uncertainty in large-scale computational astrophysics. In: MUD, pp. 63–77 (2009)
Google Scholar
Soliman, M.A., Ilyas, I.F., Chen-Chuan Chang, K.: Top-k query processing in uncertain databases. In: IEEE 23rd International Conference onData Engineering, 2007. ICDE 2007, pp. 896–905. IEEE (2007)
Google Scholar
Singla, P., Domingos, P.: Discriminative training of Markov logic networks. In: AAAI, vol. 5, pp. 868–873 (2005)
Google Scholar
Hassanzadeh, O., Chiang, F., Lee, H.C., Miller, R.J.: Framework for evaluating clustering algorithms in duplicate detection. Proc. VLDB Endowment 2(1), 1282–1293 (2009)
Article Google Scholar
Baxter, R., Christen, P., Churches, T.: A comparison of fast blocking methods for record linkage. In: ACM SIGKDD, vol. 3, pp. 25–27 (2003)
Google Scholar
Kopcke, H., Thor, A., Rahm, E.: Learning-based approaches for matching web data entities. IEEE Internet Comput. 14(4), 23–31 (2010)
Article Google Scholar
Kopcke, H., Rahm, E.: Training selection for tuning entity matching. In: QDB/MUD, pp. 3–12 (2008)
Google Scholar
Singla, P., Domingos, P.: Entity resolution with markov logic. In: Sixth International Conference on Data Mining, 2006. ICDM’06, pp. 572–582. IEEE (2006)
Google Scholar
Kok, S., Domingos, P.: Learning the structure of Markov logic networks. In: Proceedings of the 22nd International Conference on Machine Learning, pp. 441–448. ACM (2005)
Google Scholar
Ayat, N., Akbarinia, R., Afsarmanesh, H., Valduriez, P.: Entity resolution for uncertain data. In: BDA’2012: 28e Journées Bases de Données Avancées, p. 20 (2002)
Google Scholar
Das Sarma, A., Benjelloun, O., Halevy, A., Widom, J.: Working models for uncertain data. In: Proceedings of the 22nd International Conference on Data Engineering, 2006, ICDE’06, pp. 7–7. IEEE (2006)
Google Scholar
Cohen, W., Ravikumar, P., Fienberg, S.: A comparison of string metrics for matching names and records. In: Kdd Workshop on Data Cleaning and Object Consolidation, vol. 3, pp. 73–78 (2003)
Google Scholar
Yi, K., Li, F., Kollios, G., Srivastava, D.: Efficient processing of top-k queries in uncertain databases with x-relations. IEEE Trans. Knowl. Data Eng. 20(12), 1669–1682 (2008)
Article Google Scholar
Yuen, S.M., Tao, Y., Xiao, X., Pei, J., Zhang, D.: Superseding nearest neighbor search on uncertain spatial databases. IEEE Trans. Knowl. Data Eng. 22(7), 1041–1055 (2010)
Article Google Scholar
Peng, L., Diao, Y., Liu, A.: Optimizing probabilistic query processing on continuous uncertain data. Proc. VLDB Endowment 4(11), 1169–1180 (2011)
Google Scholar
McCallum, A., Wellner, B.: Object consolidation by graph partitioning with a conditionally-trained distance metric. In: KDD Workshop on Data Cleaning, Record Linkage and Object Consolidation (2003)
Google Scholar
Bilenko, M., Mooney, R., Cohen, W., Ravikumar, P., Fienberg, S.: Adaptive name matching in information integration. IEEE Intell. Syst. 18(5), 16–23 (2003)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science and Engineering, Indian School of Mines, Dhanbad, 826004, India
Ramesh Dharavath & Abhishek Kumar Singh

Authors

Ramesh Dharavath
View author publications
You can also search for this author in PubMed Google Scholar
Abhishek Kumar Singh
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ramesh Dharavath .

Editor information

Editors and Affiliations

Dept. of Computer Science and Engineering, Anil Neerukonda Institute of Technology and Sciences, Visakhapatnam, India
Suresh Chandra Satapathy
Department of CSE, CMR Technical Campus, Hyderabad, India
K. Srujan Raju
Computer Science & Engineering, Kalyani University, Nadia, West Bengal, India
Jyotsna Kumar Mandal
Electronics and Communication Engineering, Shri Ramswaroop Memorial Group of Professional Colleges, Lucknow, Uttar Pradesh, India
Vikrant Bhateja

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Dharavath, R., Singh, A.K. (2016). Entity Resolution-Based Jaccard Similarity Coefficient for Heterogeneous Distributed Databases. In: Satapathy, S., Raju, K., Mandal, J., Bhateja, V. (eds) Proceedings of the Second International Conference on Computer and Communication Technologies. Advances in Intelligent Systems and Computing, vol 379. Springer, New Delhi. https://doi.org/10.1007/978-81-322-2517-1_48

Download citation

DOI: https://doi.org/10.1007/978-81-322-2517-1_48
Published: 05 September 2015
Publisher Name: Springer, New Delhi
Print ISBN: 978-81-322-2516-4
Online ISBN: 978-81-322-2517-1
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics