Skip to main content

A Data Type-Driven Property Alignment Framework for Product Duplicate Detection on the Web

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 10041))

Abstract

During the last decade daily life has morphed into a world of broadband ubiquity, where devices facilitate constant engagement. As a consequence of this, the area of e-commerce has seen an immense growth. Despite the market opportunities for retailers and the ease for customers to acquire products through webshops, the shift to digital retail has its drawbacks. For example, it leads to cluttered and incomparable information among different webshops, which calls for an automated method to regain homogeneity in product representations. This paper presents a product duplicate detection solution, which exploits a data type-driven property alignment framework. Based on the performed experiment, we show a statistically significant improvement of the F\(_1\)-score from 47.91 % to 78.13 % compared to an existing state-of-the-art approach.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    The stopwords used can be found at http://www.ranks.nl/stopwords.

  2. 2.

    https://en.wikipedia.org/wiki/List_of_television_manufacturers.

  3. 3.

    https://corpuslinguisticmethods.wordpress.com/2014/01/15/what-is-inter-annotator-agreement/.

References

  1. Bakker, M., Frasincar, F., Vandic, D.: A hybrid model words-driven approach for web product duplicate detection. In: Salinesi, C., Norrie, M.C., Pastor, Ó. (eds.) CAiSE 2013. LNCS, vol. 7908, pp. 149–161. Springer, Heidelberg (2013). doi:10.1007/978-3-642-38709-8_10

    Chapter  Google Scholar 

  2. van Bezu, R., Borst, S., Rijkse, R., Verhagen, J., Vandic, D., Frasincar, F.: Multi-component similarity method for web product duplicate detection. In: 30th Symposium On Applied Computing (SAC 2015), pp. 761–768. ACM (2015)

    Google Scholar 

  3. Bilenko, M., Mooney, R.J.: Adaptive duplicate detection using learnable string similarity measures. In: Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD 2003), pp. 39–48. ACM (2003)

    Google Scholar 

  4. Elmagarmid, A.K., Ipeirotis, P.G., Verykios, V.S.: Duplicate record detection: a survey. IEEE Trans. Knowl. Data Eng. 19(1), 1–16 (2007)

    Article  Google Scholar 

  5. eMarketer: Retail Sales Worldwide Will Top $22 Trillion This Year. http://www.emarketer.com

  6. Lesk, M.: Automatic sense disambiguation using machine readable dictionaries: how to tell a pine cone from an ice cream cone. In: 5th Annual International Conference on Systems Documentation (SIGDOC 1986), pp. 24–26. ACM (1986)

    Google Scholar 

  7. Mann, H.B., Whitney, D.R.: On a test of whether one of two random variables is stochastically larger than the other. Ann. Math. Stat. 18, 50–60 (1947)

    Article  MathSciNet  MATH  Google Scholar 

  8. Miller, G., Beckwith, R., Felbaum, C., Gross, D., Miller, K.: Introduction to WordNet: an on-line lexical database. Int. J. Lexicography (Special Issue) 3(4), 235–312 (1990)

    Article  Google Scholar 

  9. Nederstigt, L.J., Aanen, S.S., Vandic, D., Frasincar, F.: FLOPPIES: a framework for large-scale ontology population of product information from tabular data in e-commerce stores. Decis. Support Syst. 59, 296–311 (2014)

    Article  Google Scholar 

  10. Rajaraman, A., Ullman, J.D.: Finding similar items. In: Mining of Massive Datasets, vol. 77, pp. 73–80. Cambridge University Press, Cambridge (2012)

    Google Scholar 

  11. Salton, G., Fox, E.A., Wu, H.: Extended Boolean information retrieval. Commun. ACM 26(11), 1022–1036 (1983)

    Article  MathSciNet  MATH  Google Scholar 

  12. Ukkonen, E.: Approximate string-matching with Q-grams and maximal matches. Theoret. Comput. Sci. 92(1), 191–211 (1992)

    Article  MathSciNet  MATH  Google Scholar 

  13. Vandic, D., van Dam, J.W., Frasincar, F.: A semantic-based approach for searching and browsing tag spaces. Decis. Support Syst. 54(1), 644–654 (2012)

    Article  Google Scholar 

  14. Vandic, D., Van Dam, J.W., Frasincar, F.: Faceted product search powered by the semantic web. Decis. Support Syst. 53(3), 425–437 (2012)

    Article  Google Scholar 

  15. Xiao, C., Wang, W., Lin, X., Yu, J.X., Wang, G.: Efficient similarity joins for near-duplicate detection. ACM Trans. Database Syst. 36(3), 15:1–15:41 (2011)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Flavius Frasincar .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer International Publishing AG

About this paper

Cite this paper

van Rooij, G., Sewnarain, R., Skogholt, M., van der Zaan, T., Frasincar, F., Schouten, K. (2016). A Data Type-Driven Property Alignment Framework for Product Duplicate Detection on the Web. In: Cellary, W., Mokbel, M., Wang, J., Wang, H., Zhou, R., Zhang, Y. (eds) Web Information Systems Engineering – WISE 2016. WISE 2016. Lecture Notes in Computer Science(), vol 10041. Springer, Cham. https://doi.org/10.1007/978-3-319-48740-3_28

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-48740-3_28

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-48739-7

  • Online ISBN: 978-3-319-48740-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics