Abstract
In the Semantic Web community many approaches have been evolved for generating RDF (Resource Description Framework) resources. However, they often capture duplicate resources, that are stored without elimination. In consequence, duplicate resources reduce the data quality as well as increase unnecessary size of the dataset. We propose an approach for detecting duplicate resources in RDF datasets using Hadoop and MapReduce framework. RDF resources are compared using similarity metrics defined at resource level, RDF statement level as well as object level. The performance is evaluated with the evaluation metrics and the experimental evaluation showed the accuracy, effectiveness, and efficiency of the proposed approach.
References
Elmagarmid, A.K., Ipeirotis, P.G., Verykios, V.S.: Duplicate record detection: A survey. IEEE Trans. Knowl. Data Eng. 19(1), 1–16 (2007)
Holmes, A.: Hadoop in practice. Manning Publications Co. (2012)
Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Commun. ACM 51, 107–113 (2008)
Zhou, P., Lei, J., Ye, W.: Large-scale data sets clustering based on MapReduce and Hadoop. J. Comput. Inform. Syst. 7(16), 5956–5963 (2011)
Kelkar, B.A., Manwade, K.B., Patil, G.A.: Near duplicate detection in relational database. Int. J. Eng. Res. Technol. 2(3), (2013) (ESRSA Publications)
Achimugu, P., Soriyan, A., Oluwagbemi, O., Ajayi, A.: Record Linkage system in a complex relational database-MINPHIS example. Stud. Health Technol. Inform. 160(Pt 2), 1127–1130 (2009)
Weis, M., Naumann, F.: Detecting duplicate objects in XML documents. Proceedings of the 2004 International Workshop on Information Quality in Information Systems, pp. 10–19. ACM (2004)
Weis, M., Naumann, F.: Detecting duplicates in complex XML data. Data Engineering (ICDE’06), IEEE, pp. 109–111 (2006)
Ioannou, E., Papapetrou, O., Skoutas, D., Nejdl, W.: Efficient Semantic-Aware Detection of Near Duplicate Resources. The Semantic Web: Research and Applications, pp. 136–150. Springer, Berlin (2010)
Song, D., Heflin J.: Domain-independent entity coreference in RDF graphs. Proceedings of the 19th ACM International Conference on INFORMATION and Knowledge Management, ACM, pp. 1821–1824 (2010)
Ioannou, E., Niederée, C., Nejdl, W.: Probabilistic entity linkage for heterogeneous information spaces. Advanced Information Systems Engineering, pp. 556–570. Springer, Berlin (2008)
Volz, J., Bizer, C., Gaedke, M., Kobilarov, G.: Discovering and maiNtaining Links on the Web of Data, pp. 650–665. Springer, Berlin (2009)
Li, M., Wang, H., Li, J., Gao, H.: Efficient Duplicate Record Detection Based on Similarity Estimation. International Conference on Web-Age Information Management, pp. 595–607. Springer, Berlin (2010)
Jin, H., Huang, L., Yuan, P.: K-radius Subgraph Comparison for RDF Data Cleansing. International Conference on Web-Age Information Management, pp. 309–320. Springer, Berlin (2010)
Yadagiri, N., Ramesh, P.: Semantic web and the libraries: An overview. Int. J. Library Sci. 7(1), 80–94 (2013)
Faye, D.C., Curé, O., Blin, G.A.: A survey of RDF storage approaches, pp. 11–35 (2012)
Niwattanakul, S., Singthongchai, J., Naenudorn, E., Wanapu, S.: Using of Jaccard coefficient for keywords similarity. Proc. Int. MultiConference Eng. Comput. Scientists 1, 13–15 (2013)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Sharma, K., Marjit, U., Biswas, U. (2018). Duplicate Resource Detection in RDF Datasets Using Hadoop and MapReduce. In: Kalam, A., Das, S., Sharma, K. (eds) Advances in Electronics, Communication and Computing. Lecture Notes in Electrical Engineering, vol 443. Springer, Singapore. https://doi.org/10.1007/978-981-10-4765-7_26
Download citation
DOI: https://doi.org/10.1007/978-981-10-4765-7_26
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-10-4764-0
Online ISBN: 978-981-10-4765-7
eBook Packages: EngineeringEngineering (R0)