Abstract
The detection of product duplicates is one of the challenges that Web shop aggregators are currently facing. In this paper, we focus on solving the problem of product duplicate detection on the Web. Our proposed method extends a state-of-the-art solution that uses the model words in product titles to find duplicate products. First, we employ the aforementioned algorithm in order to find matching product titles. If no matching title is found, our method continues by computing similarities between the two product descriptions. These similarities are based on the product attribute keys and on the product attribute values. Furthermore, instead of only extracting model words from the title, our method also extracts model words from the product attribute values. Based on our experimental results on real-world data gathered from two existing Web shops, we show that the proposed method, in terms of F 1-measure, significantly outperforms the existing state-of-the-art title model words method and the well-known TF-IDF method.
Chapter PDF
Similar content being viewed by others
References
Best Buy Co., Inc.: http://www.bestbuy.com
Bilenko, M., Mooney, R.: Adaptive Duplicate Detection Using Learnable String Similarity Measures. In: Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD 2003), pp. 39–48 (2003)
Bilenko, M., Mooney, R.: Adaptive Name Matching in Information Integration. IEEE Intelligent Systems 18(5), 16–23 (2003)
Cortes, C., Vapnik, V.: Support-Vector Networks. Machine Learning 20(3), 273–297 (1995)
Elmagarmid, A., Ipeirotis, P., Verykios, V.: Duplicate Record Detection: A Survey. IEEE Transactions on Knowledge and Data Engineering 19(1), 1–16 (2007)
Köpcke, H., Thor, A., Thomas, S., Rahm, E.: Tailoring Entity Resolution for Matching Product Offers. In: Proceedings of the 15th International Conference on Extending Database Technology (EDBT 2012), pp. 545–550 (2012)
Newegg Inc.: http://www.newegg.com
Salton, G., Fox, E., Wu, H.: Extended Boolean Information Retrieval. Communications of the ACM 26(11), 1022–1036 (1983)
Thomas, I., Davie, W., Weidenhamer, D.: Quarterly Retail e-commerce Sales 3rd Quarter 2012. U.S. Census Bureau News (2012)
Vandic, D., van Dam, J., Frasincar, F.: Faceted Product Search Powered by the Semantic Web. Decision Support Systems 53(3), 425–437 (2012)
Wilcoxon, F.: Individual Comparisons by Ranking Methods. Biometrics Bulletin 1(6), 80–83 (1945)
Xiao, C., Wang, W., Lin, X., Yu, J.X., Wang, G.: Efficient Similarity Joins for Near Duplicate Detection. ACM Transactions on Database Systems (TODS) 36(3), A:1–A:40 (2011)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2013 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
de Bakker, M., Frasincar, F., Vandic, D. (2013). A Hybrid Model Words-Driven Approach for Web Product Duplicate Detection. In: Salinesi, C., Norrie, M.C., Pastor, Ó. (eds) Advanced Information Systems Engineering. CAiSE 2013. Lecture Notes in Computer Science, vol 7908. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-38709-8_10
Download citation
DOI: https://doi.org/10.1007/978-3-642-38709-8_10
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-38708-1
Online ISBN: 978-3-642-38709-8
eBook Packages: Computer ScienceComputer Science (R0)