Skip to main content

Duplicate Resource Detection in RDF Datasets Using Hadoop and MapReduce

  • Conference paper
  • First Online:
Advances in Electronics, Communication and Computing

Part of the book series: Lecture Notes in Electrical Engineering ((LNEE,volume 443))

  • 1877 Accesses

Abstract

In the Semantic Web community many approaches have been evolved for generating RDF (Resource Description Framework) resources. However, they often capture duplicate resources, that are stored without elimination. In consequence, duplicate resources reduce the data quality as well as increase unnecessary size of the dataset. We propose an approach for detecting duplicate resources in RDF datasets using Hadoop and MapReduce framework. RDF resources are compared using similarity metrics defined at resource level, RDF statement level as well as object level. The performance is evaluated with the evaluation metrics and the experimental evaluation showed the accuracy, effectiveness, and efficiency of the proposed approach.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Institutional subscriptions

References

  1. Elmagarmid, A.K., Ipeirotis, P.G., Verykios, V.S.: Duplicate record detection: A survey. IEEE Trans. Knowl. Data Eng. 19(1), 1–16 (2007)

    Article  Google Scholar 

  2. Holmes, A.: Hadoop in practice. Manning Publications Co. (2012)

    Google Scholar 

  3. Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Commun. ACM 51, 107–113 (2008)

    Article  Google Scholar 

  4. Zhou, P., Lei, J., Ye, W.: Large-scale data sets clustering based on MapReduce and Hadoop. J. Comput. Inform. Syst. 7(16), 5956–5963 (2011)

    Google Scholar 

  5. Kelkar, B.A., Manwade, K.B., Patil, G.A.: Near duplicate detection in relational database. Int. J. Eng. Res. Technol. 2(3), (2013) (ESRSA Publications)

    Google Scholar 

  6. Achimugu, P., Soriyan, A., Oluwagbemi, O., Ajayi, A.: Record Linkage system in a complex relational database-MINPHIS example. Stud. Health Technol. Inform. 160(Pt 2), 1127–1130 (2009)

    Google Scholar 

  7. Weis, M., Naumann, F.: Detecting duplicate objects in XML documents. Proceedings of the 2004 International Workshop on Information Quality in Information Systems, pp. 10–19. ACM (2004)

    Google Scholar 

  8. Weis, M., Naumann, F.: Detecting duplicates in complex XML data. Data Engineering (ICDE’06), IEEE, pp. 109–111 (2006)

    Google Scholar 

  9. Ioannou, E., Papapetrou, O., Skoutas, D., Nejdl, W.: Efficient Semantic-Aware Detection of Near Duplicate Resources. The Semantic Web: Research and Applications, pp. 136–150. Springer, Berlin (2010)

    Google Scholar 

  10. Song, D., Heflin J.: Domain-independent entity coreference in RDF graphs. Proceedings of the 19th ACM International Conference on INFORMATION and Knowledge Management, ACM, pp. 1821–1824 (2010)

    Google Scholar 

  11. Ioannou, E., Niederée, C., Nejdl, W.: Probabilistic entity linkage for heterogeneous information spaces. Advanced Information Systems Engineering, pp. 556–570. Springer, Berlin (2008)

    Google Scholar 

  12. Volz, J., Bizer, C., Gaedke, M., Kobilarov, G.: Discovering and maiNtaining Links on the Web of Data, pp. 650–665. Springer, Berlin (2009)

    Google Scholar 

  13. Li, M., Wang, H., Li, J., Gao, H.: Efficient Duplicate Record Detection Based on Similarity Estimation. International Conference on Web-Age Information Management, pp. 595–607. Springer, Berlin (2010)

    Google Scholar 

  14. Jin, H., Huang, L., Yuan, P.: K-radius Subgraph Comparison for RDF Data Cleansing. International Conference on Web-Age Information Management, pp. 309–320. Springer, Berlin (2010)

    Google Scholar 

  15. Yadagiri, N., Ramesh, P.: Semantic web and the libraries: An overview. Int. J. Library Sci. 7(1), 80–94 (2013)

    Google Scholar 

  16. Faye, D.C., Curé, O., Blin, G.A.: A survey of RDF storage approaches, pp. 11–35 (2012)

    Google Scholar 

  17. Niwattanakul, S., Singthongchai, J., Naenudorn, E., Wanapu, S.: Using of Jaccard coefficient for keywords similarity. Proc. Int. MultiConference Eng. Comput. Scientists 1, 13–15 (2013)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Kumar Sharma .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Sharma, K., Marjit, U., Biswas, U. (2018). Duplicate Resource Detection in RDF Datasets Using Hadoop and MapReduce. In: Kalam, A., Das, S., Sharma, K. (eds) Advances in Electronics, Communication and Computing. Lecture Notes in Electrical Engineering, vol 443. Springer, Singapore. https://doi.org/10.1007/978-981-10-4765-7_26

Download citation

  • DOI: https://doi.org/10.1007/978-981-10-4765-7_26

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-10-4764-0

  • Online ISBN: 978-981-10-4765-7

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics