Skip to main content

When Different Is Wrong: Visual Unsupervised Validation for Web Information Extraction

  • Conference paper
  • First Online:
Machine Learning and Data Mining in Pattern Recognition (MLDM 2018)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 10935))

Abstract

This paper shows how visual information can be used to identify false positive entities from those returned by a state-of-the-art web information extraction algorithm and hence further improve extraction results. The proposed validation method is unsupervised and can be integrated into most web information extraction systems effortlessly without any impact on existing processes, system’s robustness or maintenance. Instead of relying on visual patterns, we focus on identifying visual outliers, i.e. entities that visually differ from the norm. In the context of web information extraction, we show that visual outliers tend to be erroneous extracted entities. In order to validate our method, we post-processed the entities obtained by Boilerpipe, which is known as the best overall main content extraction algorithm for web documents. We show that our validation method improves Boilerpipe’s initial precision by more than 10% while \(F_1\) score is increased by at least 3% in all relevant cases.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 59.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 79.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    In the literature, visual web information extraction may refer to the use of a graphical user interface (GUI) that allows the user to generate wrappers. This is not the intended meaning here as we refer to the visual formatting of documents.

  2. 2.

    Retained properties are the following: background-color; border-bottom-color; border-bottom-style; border-bottom-width; border-left-color; border-left-style; border-left-width; border-right-color; border-right-style; border-right-width; border-top-color; border-top-left-radius; border-top-right-radius; border-top-style; border-top-width; color; font-size; font-style; font-weight; margin-bottom; margin-left; margin-right; margin-top; outline-color; padding-bottom; padding-left; padding-right; padding-top; position; text-align; text-decoration; visibility;.

  3. 3.

    https://boilerpipe-web.appspot.com/.

  4. 4.

    https://cleaneval.sigwac.org.uk/.

  5. 5.

    https://www.nytimes.com/.

  6. 6.

    https://www.theguardian.com/.

  7. 7.

    http://www.ledevoir.com/.

  8. 8.

    https://cleaneval.sigwac.org.uk/annotation_guidelines.html.

  9. 9.

    https://github.com/kohlschutter/boilerpipe.

  10. 10.

    http://phantomjs.org/.

  11. 11.

    Most developer tools included in browsers, such as Firebug for Firefox or Chrome DevTools, allow to access computed style properties of DOM nodes.

  12. 12.

    https://rapidminer.com/.

References

  1. Agyemang, M.: Web content outlier mining: motivation, framework, and algorithms. University of Calgary (2006)

    Google Scholar 

  2. Agyemang, M., Barker, K., Alhajj, R.: Framework for mining web content outliers. In: Proceedings of the 2004 ACM Symposium on Applied Computing, pp. 590–594. ACM (2004)

    Google Scholar 

  3. Agyemang, M., Barker, K., Alhajj, R.: Web outlier mining: discovering outliers from web datasets. Intell. Data Anal. 9(5), 473–486 (2005)

    Google Scholar 

  4. Apostolova, E., Tomuro, N.: Combining visual and textual features for information extraction from online flyers. In: EMNLP, pp. 1924–1929 (2014)

    Google Scholar 

  5. Burget, R., Rudolfova, I.: Web page element classification based on visual features. In: 2009 First Asian Conference on Intelligent Information and Database Systems, ACIIDS 2009, pp. 67–72. IEEE (2009)

    Google Scholar 

  6. Chandola, V., Banerjee, A., Kumar, V.: Anomaly detection: a survey. ACM Comput. Surv. (CSUR) 41(3), 15 (2009)

    Article  Google Scholar 

  7. Chang, C.H., Kayed, M., Girgis, M.R., Shaalan, K.F.: A survey of web information extraction systems. IEEE Trans. Knowl. Data Eng. 18(10), 1411–1428 (2006)

    Article  Google Scholar 

  8. Chenthamarakshan, V., Varadarajan, R., Deshpande, P.M., Krishnapuram, R., Stolze, K.: WYSIWYE: an algebra for expressing spatial and textual rules for information extraction. In: Gao, H., Lim, L., Wang, W., Li, C., Chen, L. (eds.) WAIM 2012. LNCS, vol. 7418, pp. 419–433. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-32281-5_41

    Chapter  Google Scholar 

  9. Della Penna, G., Magazzeni, D., Orefice, S.: Visual extraction of information from web pages. J. Vis. Lang. Comput. 21(1), 23–32 (2010)

    Article  Google Scholar 

  10. Della Penna, G., Magazzeni, D., Orefice, S.: A spatial relation-based framework to perform visual information extraction. Knowl. Inf. Syst. 30(3), 667 (2012)

    Article  Google Scholar 

  11. Ferrara, E., De Meo, P., Fiumara, G., Baumgartner, R.: Web data extraction, applications and techniques: a survey. Knowl.-Based Syst. 70, 301–323 (2014)

    Article  Google Scholar 

  12. Gatterbauer, W., Bohunsky, P.: Table extraction using spatial reasoning on the CSS2 visual box model. In: Proceedings of the 21st National Conference on Artificial Intelligence (2006)

    Google Scholar 

  13. Gogar, T., Hubacek, O., Sedivy, J.: Deep neural networks for web page information extraction. In: Iliadis, L., Maglogiannis, I. (eds.) AIAI 2016. IAICT, vol. 475, pp. 154–163. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-44944-9_14

    Chapter  Google Scholar 

  14. Goldstein, M., Uchida, S.: A comparative evaluation of unsupervised anomaly detection algorithms for multivariate data. PLoS ONE 11(4), e0152173 (2016)

    Article  Google Scholar 

  15. Goldstein, M.B.: Anomaly Detection in Large Datasets. Verlag Dr. Hut, Munich (2014)

    Google Scholar 

  16. Huosong, X., Zhaoyan, F., Liuyan, P.: Chinese web text outlier mining based on domain knowledge. In: 2010 Second WRI Global Congress on Intelligent Systems (GCIS), vol. 2, pp. 73–77. IEEE (2010)

    Google Scholar 

  17. Khan, M.R.R., Ahmed, M.I., Riyad, M.A.: A novel analytical approach for identifying outliers from web documents. Int. J. Appl. Eng. Res. 12(22), 12156–12161 (2017)

    Google Scholar 

  18. Kohlschütter, C., Fankhauser, P., Nejdl, W.: Boilerplate detection using shallow text features. In: Proceedings of the Third ACM International Conference on Web Search and Data Mining, pp. 441–450. ACM (2010)

    Google Scholar 

  19. Kovacic, T.: Evaluating Web Content Extraction Algorithms. University of Ljubljana, Ljubljana (2012)

    Google Scholar 

  20. Krüpl-Sypien, B., Fayzrakhmanov, R.R., Holzinger, W., Panzenböck, M., Baumgartner, R.: A versatile model for web page representation, information extraction and content re-packaging. In: Proceedings of the 11th ACM Symposium on Document Engineering, pp. 129–138. ACM (2011)

    Google Scholar 

  21. LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521(7553), 436 (2015)

    Article  Google Scholar 

  22. Li, W., Mo, W., Zhang, X., Lu, Y., Squiers, J.J., Sellke, E.W., Fan, W., DiMaio, J.M., Thatcher, J.E.: Burn injury diagnostic imaging device’s accuracy improved by outlier detection and removal. In: SPIE Defense+ Security, p. 947206. International Society for Optics and Photonics (2015)

    Google Scholar 

  23. Vu, H., Nguyen, T.D., Travers, A., Venkatesh, S., Phung, D.: Energy-based localized anomaly detection in video surveillance. In: Kim, J., et al. (eds.) PAKDD 2017. LNCS (LNAI), vol. 10234, pp. 641–653. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-57454-7_50

    Chapter  Google Scholar 

  24. Weninger, T., Palacios, R., Crescenzi, V., Gottron, T., Merialdo, P.: Web content extraction: a meta-analysis of its past and thoughts on its future. ACM SIGKDD Explor. Newsl. 17(2), 17–23 (2016)

    Article  Google Scholar 

  25. Witten, I.H., Frank, E., Hall, M.A., Pal, C.J.: Data Mining: Practical Machine Learning Tools and Techniques. Morgan Kaufmann, Los Altos (2016)

    Google Scholar 

  26. Zhao, J., Cao, N., Wen, Z., Song, Y., Lin, Y.R., Collins, C.: # FluxFlow: visual analysis of anomalous information spreading on social media. IEEE Trans. Vis. Comput. Graph. 20(12), 1773–1782 (2014)

    Article  Google Scholar 

Download references

Acknowledgements

The authors gratefully acknowledge the financial support of the Natural Sciences and Engineering Research Council of Canada (NSERC).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Benoit Potvin .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer International Publishing AG, part of Springer Nature

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Potvin, B., Villemaire, R. (2018). When Different Is Wrong: Visual Unsupervised Validation for Web Information Extraction. In: Perner, P. (eds) Machine Learning and Data Mining in Pattern Recognition. MLDM 2018. Lecture Notes in Computer Science(), vol 10935. Springer, Cham. https://doi.org/10.1007/978-3-319-96133-0_10

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-96133-0_10

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-96132-3

  • Online ISBN: 978-3-319-96133-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics