Content Extraction from Marketing Flyers
The rise of online shopping has hurt physical retailers, which struggle to persuade customers to buy products in physical stores rather than online. Marketing flyers are a great mean to increase the visibility of physical retailers, but the unstructured offers appearing in those documents cannot be easily compared with similar online deals, making it hard for a customer to understand whether it is more convenient to order a product online or to buy it from the physical shop. In this work we tackle this problem, introducing a content extraction algorithm that automatically extracts structured data from flyers. Unlike competing approaches that mainly focus on textual content or simply analyze font type, color and text positioning, we propose novel and more advanced visual features that capture the properties of graphic elements typically used in marketing materials to attract the attention of readers towards specific deals, obtaining excellent results and a high language and genre independence.
KeywordsContent extraction Portable document format Visual features Marketing flyers
Unable to display preview. Download preview PDF.
- 2.Ratinov, L., Roth, D.: Design challenges and misconceptions in named entity recognition. In: CoNNL, pp. 147–155 (2009)Google Scholar
- 3.Ling, X., Weld, D.: Fine-grained entity recognition. In: AAAI (2012)Google Scholar
- 5.Prokofyev, R., Demartini, G., Cudré-Mauroux, P.: Effective named entity recognition for idiosyncratic web collections. In: WWW, pp. 397–408 (2014)Google Scholar
- 6.Apostolova, E., Tomuro, N.: Combining visual and textual features for information extraction from online flyers. In: EMNLP, pp. 1924–1929 (2014)Google Scholar
- 7.Zhou, Z., Mashuq, M., Sun, L.: Web content extraction through machine learning (2014)Google Scholar
- 8.Burget, R.: Layout based information extraction from html documents. In: ICDAR, pp. 624–628 (2007)Google Scholar
- 9.Burget, R., Rudolfova, I.: Web page element classification based on visual features. In: ACIIDS, pp. 67–72 (2009)Google Scholar
- 10.Sun, F., Song, D., Liao, L.: Dom based content extraction via text density. In: SIGIR, pp. 245–254 (2011)Google Scholar
- 11.Smith, R.: An overview of the tesseract ocr engine. In: ICDAR, pp. 629–6332 (2007)Google Scholar
- 13.Bosch, A., Zisserman, A., Munoz, X.: Image classification using random forests and ferns. In: ICCV, pp. 1–8 (2007)Google Scholar