Skip to main content

The WDC Gold Standards for Product Feature Extraction and Product Matching

  • Conference paper
  • First Online:
E-Commerce and Web Technologies (EC-Web 2016)

Part of the book series: Lecture Notes in Business Information Processing ((LNBIP,volume 278))

Included in the following conference series:

Abstract

Finding out which e-shops offer a specific product is a central challenge for building integrated product catalogs and comparison shopping portals. Determining whether two offers refer to the same product involves extracting a set of features (product attributes) from the web pages containing the offers and comparing these features using a matching function. The existing gold standards for product matching have two shortcomings: (i) they only contain offers from a small number of e-shops and thus do not properly cover the heterogeneity that is found on the Web. (ii) they only provide a small number of generic product attributes and therefore cannot be used to evaluate whether detailed product attributes have been correctly extracted from textual product descriptions. To overcome these shortcomings, we have created two public gold standards: The WDC Product Feature Extraction Gold Standard consists of over 500 product web pages originating from 32 different websites on which we have annotated all product attributes (338 distinct attributes) which appear in product titles, product descriptions, as well as tables and lists. The WDC Product Matching Gold Standard consists of over \(75\,000\) correspondences between 150 products (mobile phones, TVs, and headphones) in a central catalog and offers for these products on the 32 web sites. To verify that the gold standards are challenging enough, we ran several baseline feature extraction and matching methods, resulting in F-score values in the range 0.39 to 0.67. In addition to the gold standards, we also provide a corpus consisting of 13 million product pages from the same websites which might be useful as background knowledge for training feature extraction and matching methods.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    Retail e-commerce sales worldwide from 2014 to 2019 - http://www.statista.com/statistics/379046/worldwide-retail-e-commerce-sales/.

  2. 2.

    http://webdatacommons.org.

  3. 3.

    http://webdatacommons.org/productcorpus/.

  4. 4.

    http://www.alexa.com/.

  5. 5.

    https://www.w3.org/TR/microdata/.

  6. 6.

    http://any23.apache.org/.

  7. 7.

    https://github.com/scrapy/scrapy.

  8. 8.

    http://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html.

  9. 9.

    http://dbs.uni-leipzig.de/en/research/projects/object_matching/fever/benchmark_datasets_for_entity_resolution.

  10. 10.

    http://webdatacommons.org/structureddata/index.html.

References

  1. Gopalakrishnan, V., Iyengar, S.P., Madaan, A., Rastogi, R., Sengamedu, S.: Matching product titles using web-based enrichment. In: Proceedings of the 21st ACM International Conference on Information and Knowledge Management, CIKM 2012, pp. 605–614. ACM, New York (2012)

    Google Scholar 

  2. Kannan, A., Givoni, I.E., Agrawal, R., Fuxman, A.: Matching unstructured product offers to structured product specifications. In: 17th ACM SIGKDD International Conference On Knowledge Discovery and Data Mining, pp. 404–412 (2011)

    Google Scholar 

  3. Köpcke, H., Thor, A., Rahm, E.: Evaluation of entity resolution approaches on real-world match problems. Proc. VLDB Endowment 3(1–2), 484–493 (2010)

    Article  Google Scholar 

  4. Köpcke, H., Thor, A., Thomas, S., Rahm, E.: Tailoring entity resolution for matching product offers. In: Proceedings of the 15th International Conference on Extending Database Technology, pp. 545–550. ACM (2012)

    Google Scholar 

  5. Le, Q.V., Mikolov, T.: Distributed representations of sentences, documents. arXiv preprint arXiv:1405.4053 (2014)

  6. McAuley, J., Targett, C., Shi, Q., van den Hengel, A.: Image-based recommendations on styles and substitutes. In: Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 43–52. ACM (2015)

    Google Scholar 

  7. Melli, G.: Shallow semantic parsing of product offering titles (for better automatic hyperlink insertion). In: Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2014, pp. 1670–1678. ACM, New York (2014)

    Google Scholar 

  8. Meusel, R., Petrovski, P., Bizer, C.: The webdatacommons microdata, RDFa and microformat dataset series. In: Mika, P., Tudorache, T., Bernstein, A., Welty, C., Knoblock, C., Vrandečić, D., Groth, P., Noy, N., Janowicz, K., Goble, C. (eds.) ISWC 2014. LNCS, vol. 8796, pp. 277–292. Springer, Heidelberg (2014). doi:10.1007/978-3-319-11964-9_18

    Google Scholar 

  9. Meusel, R., Primpeli, A., Meilicke, C., Paulheim, H., Bizer, C.: Exploiting microdata annotations to consistently categorize product offers at web scale. In: Stuckenschmidt, H., Jannach, D. (eds.) EC-Web 2015. LNBIP, vol. 239, pp. 83–99. Springer, Heidelberg (2015). doi:10.1007/978-3-319-27729-5_7

    Chapter  Google Scholar 

  10. Nguyen, H., Fuxman, A., Paparizos, S., Freire, J., Agrawal, R.: Synthesizing products for online catalogs. Proc. VLDB Endowment 4(7), 409–418 (2011)

    Article  Google Scholar 

  11. Petrovski, P., Bryl, V., Bizer, C.: Integrating product data from websites offering microdata markup. In: Proceedings of the Companion Publication of the 23rd International Conference on World Wide Web Companion, pp. 1299–1304. International World Wide Web Conferences Steering Committee (2014)

    Google Scholar 

  12. Petrovski, P., Bryl, V., Bizer, C.: Learning regular expressions for the extraction of product attributes from e-commerce microdata (2014)

    Google Scholar 

  13. Qiu, D., Barbosa, L., Dong, X.L., Shen, Y., Srivastava, D.: Dexter: large-scale discovery and extraction of product specifications on the web. Proc. VLDB Endowment 8(13), 2194–2205 (2015)

    Article  Google Scholar 

  14. Ristoski, P., Mika, P.: Enriching product ads with metadata from HTML annotations. In: Proceedings of the 13th Extended Semantic Web Conference (2015, to appear)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Petar Petrovski .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer International Publishing AG

About this paper

Cite this paper

Petrovski, P., Primpeli, A., Meusel, R., Bizer, C. (2017). The WDC Gold Standards for Product Feature Extraction and Product Matching. In: Bridge, D., Stuckenschmidt, H. (eds) E-Commerce and Web Technologies. EC-Web 2016. Lecture Notes in Business Information Processing, vol 278. Springer, Cham. https://doi.org/10.1007/978-3-319-53676-7_6

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-53676-7_6

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-53675-0

  • Online ISBN: 978-3-319-53676-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics