Duplicate Resource Detection in RDF Datasets Using Hadoop and MapReduce

Sharma, Kumar; Marjit, Ujjal; Biswas, Utpal

doi:10.1007/978-981-10-4765-7_26

Kumar Sharma³⁷,
Ujjal Marjit³⁸ &
Utpal Biswas³⁷

Part of the book series: Lecture Notes in Electrical Engineering ((LNEE,volume 443))

1877 Accesses

Abstract

In the Semantic Web community many approaches have been evolved for generating RDF (Resource Description Framework) resources. However, they often capture duplicate resources, that are stored without elimination. In consequence, duplicate resources reduce the data quality as well as increase unnecessary size of the dataset. We propose an approach for detecting duplicate resources in RDF datasets using Hadoop and MapReduce framework. RDF resources are compared using similarity metrics defined at resource level, RDF statement level as well as object level. The performance is evaluated with the evaluation metrics and the experimental evaluation showed the accuracy, effectiveness, and efficiency of the proposed approach.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Institutional subscriptions

References

Elmagarmid, A.K., Ipeirotis, P.G., Verykios, V.S.: Duplicate record detection: A survey. IEEE Trans. Knowl. Data Eng. 19(1), 1–16 (2007)
Article Google Scholar
Holmes, A.: Hadoop in practice. Manning Publications Co. (2012)
Google Scholar
Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Commun. ACM 51, 107–113 (2008)
Article Google Scholar
Zhou, P., Lei, J., Ye, W.: Large-scale data sets clustering based on MapReduce and Hadoop. J. Comput. Inform. Syst. 7(16), 5956–5963 (2011)
Google Scholar
Kelkar, B.A., Manwade, K.B., Patil, G.A.: Near duplicate detection in relational database. Int. J. Eng. Res. Technol. 2(3), (2013) (ESRSA Publications)
Google Scholar
Achimugu, P., Soriyan, A., Oluwagbemi, O., Ajayi, A.: Record Linkage system in a complex relational database-MINPHIS example. Stud. Health Technol. Inform. 160(Pt 2), 1127–1130 (2009)
Google Scholar
Weis, M., Naumann, F.: Detecting duplicate objects in XML documents. Proceedings of the 2004 International Workshop on Information Quality in Information Systems, pp. 10–19. ACM (2004)
Google Scholar
Weis, M., Naumann, F.: Detecting duplicates in complex XML data. Data Engineering (ICDE’06), IEEE, pp. 109–111 (2006)
Google Scholar
Ioannou, E., Papapetrou, O., Skoutas, D., Nejdl, W.: Efficient Semantic-Aware Detection of Near Duplicate Resources. The Semantic Web: Research and Applications, pp. 136–150. Springer, Berlin (2010)
Google Scholar
Song, D., Heflin J.: Domain-independent entity coreference in RDF graphs. Proceedings of the 19th ACM International Conference on INFORMATION and Knowledge Management, ACM, pp. 1821–1824 (2010)
Google Scholar
Ioannou, E., Niederée, C., Nejdl, W.: Probabilistic entity linkage for heterogeneous information spaces. Advanced Information Systems Engineering, pp. 556–570. Springer, Berlin (2008)
Google Scholar
Volz, J., Bizer, C., Gaedke, M., Kobilarov, G.: Discovering and maiNtaining Links on the Web of Data, pp. 650–665. Springer, Berlin (2009)
Google Scholar
Li, M., Wang, H., Li, J., Gao, H.: Efficient Duplicate Record Detection Based on Similarity Estimation. International Conference on Web-Age Information Management, pp. 595–607. Springer, Berlin (2010)
Google Scholar
Jin, H., Huang, L., Yuan, P.: K-radius Subgraph Comparison for RDF Data Cleansing. International Conference on Web-Age Information Management, pp. 309–320. Springer, Berlin (2010)
Google Scholar
Yadagiri, N., Ramesh, P.: Semantic web and the libraries: An overview. Int. J. Library Sci. 7(1), 80–94 (2013)
Google Scholar
Faye, D.C., Curé, O., Blin, G.A.: A survey of RDF storage approaches, pp. 11–35 (2012)
Google Scholar
Niwattanakul, S., Singthongchai, J., Naenudorn, E., Wanapu, S.: Using of Jaccard coefficient for keywords similarity. Proc. Int. MultiConference Eng. Comput. Scientists 1, 13–15 (2013)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science and Engineering, University of Kalyani, Kalyani, West Bengal, India
Kumar Sharma & Utpal Biswas
Centre for Information Resource Management (CIRM), University of Kalyani, Kalyani, West Bengal, India
Ujjal Marjit

Authors

Kumar Sharma
View author publications
You can also search for this author in PubMed Google Scholar
Ujjal Marjit
View author publications
You can also search for this author in PubMed Google Scholar
Utpal Biswas
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Kumar Sharma .

Editor information

Editors and Affiliations

Smart Energy Research Unit, College of Engineering and Science, Victoria University, Melbourne, VIC, Australia
Akhtar Kalam
Electronics and Communication Sciences Unit, Indian Statistical Institute, Kolkata, West Bengal, India
Swagatam Das
Department of Computer Science and Engineering, Sikkim Manipal Institute of Technology, Rangpo, Sikkim, India
Kalpana Sharma

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Sharma, K., Marjit, U., Biswas, U. (2018). Duplicate Resource Detection in RDF Datasets Using Hadoop and MapReduce. In: Kalam, A., Das, S., Sharma, K. (eds) Advances in Electronics, Communication and Computing. Lecture Notes in Electrical Engineering, vol 443. Springer, Singapore. https://doi.org/10.1007/978-981-10-4765-7_26

Download citation

DOI: https://doi.org/10.1007/978-981-10-4765-7_26
Published: 29 October 2017
Publisher Name: Springer, Singapore
Print ISBN: 978-981-10-4764-0
Online ISBN: 978-981-10-4765-7
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics