When Different Is Wrong: Visual Unsupervised Validation for Web Information Extraction

Potvin, Benoit; Villemaire, Roger

doi:10.1007/978-3-319-96133-0_10

Benoit Potvin¹⁴ &
Roger Villemaire¹⁴

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 10935))

Included in the following conference series:

International Conference on Machine Learning and Data Mining in Pattern Recognition

2017 Accesses
1 Citations

Abstract

This paper shows how visual information can be used to identify false positive entities from those returned by a state-of-the-art web information extraction algorithm and hence further improve extraction results. The proposed validation method is unsupervised and can be integrated into most web information extraction systems effortlessly without any impact on existing processes, system’s robustness or maintenance. Instead of relying on visual patterns, we focus on identifying visual outliers, i.e. entities that visually differ from the norm. In the context of web information extraction, we show that visual outliers tend to be erroneous extracted entities. In order to validate our method, we post-processed the entities obtained by Boilerpipe, which is known as the best overall main content extraction algorithm for web documents. We show that our validation method improves Boilerpipe’s initial precision by more than 10% while \(F_1\) score is increased by at least 3% in all relevant cases.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 59.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
In the literature, visual web information extraction may refer to the use of a graphical user interface (GUI) that allows the user to generate wrappers. This is not the intended meaning here as we refer to the visual formatting of documents.
2.
Retained properties are the following: background-color; border-bottom-color; border-bottom-style; border-bottom-width; border-left-color; border-left-style; border-left-width; border-right-color; border-right-style; border-right-width; border-top-color; border-top-left-radius; border-top-right-radius; border-top-style; border-top-width; color; font-size; font-style; font-weight; margin-bottom; margin-left; margin-right; margin-top; outline-color; padding-bottom; padding-left; padding-right; padding-top; position; text-align; text-decoration; visibility;.
3.
https://boilerpipe-web.appspot.com/.
4.
https://cleaneval.sigwac.org.uk/.
5.
https://www.nytimes.com/.
6.
https://www.theguardian.com/.
7.
http://www.ledevoir.com/.
8.
https://cleaneval.sigwac.org.uk/annotation_guidelines.html.
9.
https://github.com/kohlschutter/boilerpipe.
10.
http://phantomjs.org/.
11.
Most developer tools included in browsers, such as Firebug for Firefox or Chrome DevTools, allow to access computed style properties of DOM nodes.
12.
https://rapidminer.com/.

References

Agyemang, M.: Web content outlier mining: motivation, framework, and algorithms. University of Calgary (2006)
Google Scholar
Agyemang, M., Barker, K., Alhajj, R.: Framework for mining web content outliers. In: Proceedings of the 2004 ACM Symposium on Applied Computing, pp. 590–594. ACM (2004)
Google Scholar
Agyemang, M., Barker, K., Alhajj, R.: Web outlier mining: discovering outliers from web datasets. Intell. Data Anal. 9(5), 473–486 (2005)
Google Scholar
Apostolova, E., Tomuro, N.: Combining visual and textual features for information extraction from online flyers. In: EMNLP, pp. 1924–1929 (2014)
Google Scholar
Burget, R., Rudolfova, I.: Web page element classification based on visual features. In: 2009 First Asian Conference on Intelligent Information and Database Systems, ACIIDS 2009, pp. 67–72. IEEE (2009)
Google Scholar
Chandola, V., Banerjee, A., Kumar, V.: Anomaly detection: a survey. ACM Comput. Surv. (CSUR) 41(3), 15 (2009)
Article Google Scholar
Chang, C.H., Kayed, M., Girgis, M.R., Shaalan, K.F.: A survey of web information extraction systems. IEEE Trans. Knowl. Data Eng. 18(10), 1411–1428 (2006)
Article Google Scholar
Chenthamarakshan, V., Varadarajan, R., Deshpande, P.M., Krishnapuram, R., Stolze, K.: WYSIWYE: an algebra for expressing spatial and textual rules for information extraction. In: Gao, H., Lim, L., Wang, W., Li, C., Chen, L. (eds.) WAIM 2012. LNCS, vol. 7418, pp. 419–433. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-32281-5_41
Chapter Google Scholar
Della Penna, G., Magazzeni, D., Orefice, S.: Visual extraction of information from web pages. J. Vis. Lang. Comput. 21(1), 23–32 (2010)
Article Google Scholar
Della Penna, G., Magazzeni, D., Orefice, S.: A spatial relation-based framework to perform visual information extraction. Knowl. Inf. Syst. 30(3), 667 (2012)
Article Google Scholar
Ferrara, E., De Meo, P., Fiumara, G., Baumgartner, R.: Web data extraction, applications and techniques: a survey. Knowl.-Based Syst. 70, 301–323 (2014)
Article Google Scholar
Gatterbauer, W., Bohunsky, P.: Table extraction using spatial reasoning on the CSS2 visual box model. In: Proceedings of the 21st National Conference on Artificial Intelligence (2006)
Google Scholar
Gogar, T., Hubacek, O., Sedivy, J.: Deep neural networks for web page information extraction. In: Iliadis, L., Maglogiannis, I. (eds.) AIAI 2016. IAICT, vol. 475, pp. 154–163. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-44944-9_14
Chapter Google Scholar
Goldstein, M., Uchida, S.: A comparative evaluation of unsupervised anomaly detection algorithms for multivariate data. PLoS ONE 11(4), e0152173 (2016)
Article Google Scholar
Goldstein, M.B.: Anomaly Detection in Large Datasets. Verlag Dr. Hut, Munich (2014)
Google Scholar
Huosong, X., Zhaoyan, F., Liuyan, P.: Chinese web text outlier mining based on domain knowledge. In: 2010 Second WRI Global Congress on Intelligent Systems (GCIS), vol. 2, pp. 73–77. IEEE (2010)
Google Scholar
Khan, M.R.R., Ahmed, M.I., Riyad, M.A.: A novel analytical approach for identifying outliers from web documents. Int. J. Appl. Eng. Res. 12(22), 12156–12161 (2017)
Google Scholar
Kohlschütter, C., Fankhauser, P., Nejdl, W.: Boilerplate detection using shallow text features. In: Proceedings of the Third ACM International Conference on Web Search and Data Mining, pp. 441–450. ACM (2010)
Google Scholar
Kovacic, T.: Evaluating Web Content Extraction Algorithms. University of Ljubljana, Ljubljana (2012)
Google Scholar
Krüpl-Sypien, B., Fayzrakhmanov, R.R., Holzinger, W., Panzenböck, M., Baumgartner, R.: A versatile model for web page representation, information extraction and content re-packaging. In: Proceedings of the 11th ACM Symposium on Document Engineering, pp. 129–138. ACM (2011)
Google Scholar
LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521(7553), 436 (2015)
Article Google Scholar
Li, W., Mo, W., Zhang, X., Lu, Y., Squiers, J.J., Sellke, E.W., Fan, W., DiMaio, J.M., Thatcher, J.E.: Burn injury diagnostic imaging device’s accuracy improved by outlier detection and removal. In: SPIE Defense+ Security, p. 947206. International Society for Optics and Photonics (2015)
Google Scholar
Vu, H., Nguyen, T.D., Travers, A., Venkatesh, S., Phung, D.: Energy-based localized anomaly detection in video surveillance. In: Kim, J., et al. (eds.) PAKDD 2017. LNCS (LNAI), vol. 10234, pp. 641–653. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-57454-7_50
Chapter Google Scholar
Weninger, T., Palacios, R., Crescenzi, V., Gottron, T., Merialdo, P.: Web content extraction: a meta-analysis of its past and thoughts on its future. ACM SIGKDD Explor. Newsl. 17(2), 17–23 (2016)
Article Google Scholar
Witten, I.H., Frank, E., Hall, M.A., Pal, C.J.: Data Mining: Practical Machine Learning Tools and Techniques. Morgan Kaufmann, Los Altos (2016)
Google Scholar
Zhao, J., Cao, N., Wen, Z., Song, Y., Lin, Y.R., Collins, C.: # FluxFlow: visual analysis of anomalous information spreading on social media. IEEE Trans. Vis. Comput. Graph. 20(12), 1773–1782 (2014)
Article Google Scholar

Download references

Acknowledgements

The authors gratefully acknowledge the financial support of the Natural Sciences and Engineering Research Council of Canada (NSERC).

Author information

Authors and Affiliations

Department of Computer Science, Université du Québec à Montréal, Montréal, H3C 3P8, Canada
Benoit Potvin & Roger Villemaire

Authors

Benoit Potvin
View author publications
You can also search for this author in PubMed Google Scholar
Roger Villemaire
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Benoit Potvin .

Editor information

Editors and Affiliations

Institute of Computer Vision and Applied Computer Sciences, Leipzig, Germany
Petra Perner

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Potvin, B., Villemaire, R. (2018). When Different Is Wrong: Visual Unsupervised Validation for Web Information Extraction. In: Perner, P. (eds) Machine Learning and Data Mining in Pattern Recognition. MLDM 2018. Lecture Notes in Computer Science(), vol 10935. Springer, Cham. https://doi.org/10.1007/978-3-319-96133-0_10

Download citation

DOI: https://doi.org/10.1007/978-3-319-96133-0_10
Published: 08 July 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-96132-3
Online ISBN: 978-3-319-96133-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics