Exploiting Microdata Annotations to Consistently Categorize Product Offers at Web Scale

  • Robert MeuselEmail author
  • Anna Primpeli
  • Christian Meilicke
  • Heiko Paulheim
  • Christian Bizer
Conference paper
Part of the Lecture Notes in Business Information Processing book series (LNBIP, volume 239)


Semantically annotated data, using markup languages like RDFa and Microdata, has become more and more publicly available in the Web, especially in the area of e-commerce. Thus, a large amount of structured product descriptions are freely available and can be used for various applications, such as product search or recommendation. However, little efforts have been made to analyze the categories of the available product descriptions. Although some products have an explicit category assigned, the categorization schemes vary a lot, as the products originate from thousands of different sites. This heterogeneity makes the use of supervised methods, which have been proposed by most previous works, hard to apply. Therefore, in this paper, we explain how distantly supervised approaches can be used to exploit the heterogeneous category information in order to map the products to set of target categories from an existing product catalogue. Our results show that, even though this task is by far not trivial, we can reach almost \(56\,\%\) accuracy for classifying products into 37 categories.


Microdata RDFa Structured web data Classification 


  1. 1.
    Bizer, C., Eckert, K., Meusel, R., Mühleisen, H., Schuhmacher, M., Völker, J.: Deployment of RDFa, microdata, and microformats on the web – a quantitative analysis. In: Alani, H., et al. (eds.) ISWC 2013, Part II. LNCS, vol. 8219, pp. 17–32. Springer, Heidelberg (2013) CrossRefGoogle Scholar
  2. 2.
    Domingos, P., Lowd, D.: Markov logic: An interface layer for artificial intelligence. Synth. Lect. Artif. Intell. Mach. Learn. 3(1), 1–155 (2009)CrossRefzbMATHGoogle Scholar
  3. 3.
    Eberius, J., Thiele, M., Braunschweig, K., Lehner, W.: Top-k entity augmentation using consistent set covering. In: SSDBM 2015 (2015)Google Scholar
  4. 4.
  5. 5.
    Kolb, P.: Disco: A multilingual database of distributionally similar words.In: Proceedings of KONVENS (2008)Google Scholar
  6. 6.
    Lehmberg, O., Ritze, D., Ristoski, P., Meusel, R., Paulheim, H., Bizer, C.: Mannheim Search Join Engine. Science, Services and Agents on the World Wide Web, Web Semantics (2015)Google Scholar
  7. 7.
    Meusel, R., Bizer, C., Paulheim, H.: A web-scale study of the adoption and evolution of the vocabulary over time. In: Proceedings WIMS 2015, pp. 15:1–15:11. ACM, New York, NY, USA (2015)Google Scholar
  8. 8.
    Meusel, R., Paulheim, H.: Heuristics for fixing errors in deployed microdata. In: Extended Semantic Web Conference (2015)Google Scholar
  9. 9.
    Meusel, R., Petrovski, P., Bizer, C.: The webdatacommons microdata, RDFa and microformat dataset series. In: Mika, P., et al. (eds.) ISWC 2014, Part I. LNCS, vol. 8796, pp. 277–292. Springer, Heidelberg (2014) Google Scholar
  10. 10.
    Mika, P.: Microformats and RDFa deployment across the Web (2011).
  11. 11.
    Mika, P., Potter, T.: Metadata statistics for a large web corpus. In: LDOW 2012, CEUR Workshop Proceedings, vol. 937. (2012)Google Scholar
  12. 12.
    Nguyen, H., Fuxman, A., Paparizos, S., Freire, J., Agrawal, R.: Synthesizing products for online catalogs. Proc. VLDB Endow. 4(7), 409–418 (2011)CrossRefGoogle Scholar
  13. 13.
    Noessner, J., Niepert, M., Stuckenschmidt, H.: Rockit: Exploiting parallelism and symmetry for MAP inference in statistical relational models. In: Proceedings of the AAAI 2013 (2013)Google Scholar
  14. 14.
    Patel-Schneider, P.F.: Analyzing In: Mika, P., et al. (eds.) ISWC 2014, Part I. LNCS, vol. 8796, pp. 261–276. Springer, Heidelberg (2014) Google Scholar
  15. 15.
    Petrovski, P., Bryl, V., Bizer, C.: Integrating product data from websites offering microdata markup. In: DEOS 2014 (2014)Google Scholar
  16. 16.
    Qiu, D., Barbosa, L., Dong, X.L., Shen, Y., Srivastava, D.: Dexter: Large-scale discovery and extraction of product specifications on the web. Proc. VLDB Endowment 8(13), 2194–2205 (2015)CrossRefGoogle Scholar
  17. 17.
    Ritze, D., Lehmberg, O., Bizer, C.: Matching html tables to dbpedia. In: Proceedings of the 5th International Conference on Web Intelligence, Mining and Semantics, p. 10. ACM (2015)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2015

Authors and Affiliations

  • Robert Meusel
    • 1
    Email author
  • Anna Primpeli
    • 1
  • Christian Meilicke
    • 1
  • Heiko Paulheim
    • 1
  • Christian Bizer
    • 1
  1. 1.Data and Web Science GroupUniversity of MannheimMannheimGermany

Personalised recommendations