Survey and empirical comparison of different approaches for text extraction from scholarly figures

Böschen, Falk; Beck, Tilman; Scherp, Ansgar

doi:10.1007/s11042-018-6162-7

Survey and empirical comparison of different approaches for text extraction from scholarly figures

Published: 02 June 2018

Volume 77, pages 29475–29505, (2018)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

431 Accesses
8 Citations
2 Altmetric
Explore all metrics

Abstract

Different approaches have been proposed in the past to address the challenge of extracting text from scholarly figures. However, until recently, no comparative evaluation of the different approaches had been conducted. Thus, we performed an extensive study of the related work and evaluated in total 32 different approaches. In this work, we perform a more detailed comparison of the 7 most relevant approaches described in the literature and extend to 37 systematic linear combinations of methods for extracting text from scholarly figures. Our generic pipeline, consisting of six steps, allows us to freely combine the different possible methods and perform a fair comparison. Overall, we have evaluated 44 different linear pipeline configurations and systematically compared the different methods. We then derived two non-linear configurations and a two-pass approach. We evaluate all pipeline configurations over four datasets of scholarly figures of different origin and characteristics. The quality of the extraction results is assessed using F-measure and Levenshtein distance, and we measure the runtime performance. Our experiments showed that there is a linear configuration that overall shows the best text extraction quality on all datasets. Further experiments showed that the best configuration can be improved by extending it to a two-pass approach. Regarding the runtime, we observed huge differences from very fast approaches to those running for several weeks. Our experiments found the best working configuration for text extraction from our method set. However, they also showed that further improvements regarding region extraction and classification are needed.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 2

Artificial intelligence to automate the systematic review of scientific literature

Article Open access 11 May 2023

How to Check for Plagiarism?

A tale of two databases: the use of Web of Science and Scopus in academic papers

Article 22 February 2020

Notes

http://www.kd.informatik.uni-kiel.de/en/research/software/text-extraction, last access: September, 2017
http://www.degruyter.com/, last access: September, 2017
http://www.kd.informatik.uni-kiel.de/en/research/software/text-extraction, last access: September, 2017
https://github.com/tesseract-ocr/, last access: September, 2017
http://www.abbyy.com/ocr-sdk/, last access: September, 2017
http://www-e.uni-magdeburg.de/jschulen/ocr/index.html, last access: September, 2017
https://www.nuance.com/print-capture-and-pdf-solutions/optical-character-recognition/omnipage/omnipage-server-for-developers.html, last access: September, 2017
https://github.com/tmbdev/ocropy, last access: September, 2017
https://www.abbyy.com/en-us/ocr-sdk/, last access: September, 2017
https://www.econbiz.de/, last access: September, 2017
http://www.degruyter.com/, last access: September, 2017
http://www.degruyter.com/dg/page/open-access-policy, last access: September, 2017
https://www.comp.nus.edu.sg/tancl/ChartImageDataset.htm, last access: September, 2017

References

Böschen F, Scherp A (2015) Formalization and preliminary evaluation of a pipeline for text extraction from infographics. In: Bergmann R, Görg S, Müller G (eds) Proceedings of the LWA 2015 Workshops: KDML, FGWM, IR, and FGDB. CEUR-WS.org. volume 1458 of CEUR Workshop Proceedings, Trier, pp 20–31
Böschen F, Scherp A (2015) Multi-oriented text extraction from information graphics. In: Vanoirbeek C, Genevés P (eds) Proceedings of the 2015 ACM Symposium on Document Engineering, DocEng 2015. ACM, Lausanne, pp 35–38
Böschen F, Scherp A (2017) A comparison of approaches for automated text extraction from scholarly figures. In: MultiMedia Modeling - 23rd International Conference, MMM 2017, Reykjavik, Proceedings, Part I, volume 10132 of Lecture Notes in Computer Science. Springer, pp 15–27
Carberry S, Elzer S, Demir S (2006) Information graphics: an untapped resource for digital libraries. In: Efthimiadis EN, Dumais ST, Hawking D, Järvelin K (eds) SIGIR 2006: Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, Seattle, pp 581–588
Carberry S, Schwartz SE, McCoy KF, Demir S, Wu P, Greenbacker CF, Chester D, Schwartz E, Oliver D, Moraes PS (2012) Access to multimodal articles for individuals with sight impairments. ACM Trans Interact Intell Syst 2(4):21
Article Google Scholar
Chen Z, Cafarella MJ, Adar E (2015) Diagramflyer: A search engine for data-driven diagrams. In: Gangemi A, Leonardi S, Panconesi A (eds) Proceedings of the 24th International Conference on World Wide Web Companion, WWW 2015, Florence, 2015 - Companion Volume. ACM, pp 183–186
Chester D, Elzer S (2005) Getting computers to see information graphics so users do not have to. In: Hacid M, Murray NV, Ras ZW, Tsumoto S (eds) editors, Foundations of Intelligent Systems, 15th International Symposium, ISMIS 2005, Saratoga Springs, Proceedings, volume 3488 of Lecture Notes in Computer Science. Springer, pp 660–668
Chiang Y, Knoblock CA (2013) A general approach for extracting road vector data from raster maps. Int J Doc Anal Recogn (IJDAR) 16(1):55–81
Article Google Scholar
Chiang Y, Knoblock CA (2015) Recognizing text in raster maps. GeoInformatica 19(1):1–27
Article Google Scholar
Choudhury SR, Giles CL (2015) An architecture for information extraction from figures in digital libraries. In: Gangemi A, Leonardi S, Panconesi A (eds) Proceedings of the 24th International Conference on World Wide Web Companion, WWW 2015, Florence, 2015 - Companion Volume. ACM, pp 667–672
Deseilligny MP, Men HL, Stamon G (1995) Character string recognition on maps, a rotation-invariant recognition method. Pattern Recogn Lett 16(12):1297–1310
Article Google Scholar
Fraz M, Sarfraz MS, Edirisinghe EA (2015) Exploiting colour information for better scene text detection and recognition. Int J Doc Anal Recogn (IJDAR) 18 (2):153–167
Article Google Scholar
Gao G, Zhang H, Chen H (2015) A robust video text extraction and recognition approach using OCR feedback information. In: Ho Y, Sang J, Ro YM, Kim J, Wu F (eds) Advances in Multimedia Information Processing - PCM 2015 - 16th Pacific-Rim Conference on Multimedia, Gwangju, Proceedings, Part I, volume 9314 of Lecture Notes in Computer Science. Springer, pp 507–517
Gllavata J, Freisleben B (2005) Adaptive fuzzy text segmentation in images with complex backgrounds using color and texture. In: Gagalowicz A, Philips W (eds) Computer Analysis of Images and Patterns, 11th International Conference, CAIP 2005, Versailles, Proceedings, volume 3691 of Lecture Notes in Computer Science. Springer, pp 756–765
Huang W, Tan CL, King PR, Simske SJ (2007) A system for understanding imaged infographics and its applications. In: Proceedings of the 2007 ACM Symposium on Document Engineering. ACM, Winnipeg, pp 9–18
Huang W, Tan CL, Leow WK (2005) Associating text and graphics for scientific chart understanding. In: Eighth International Conference on Document Analysis and Recognition (ICDAR 2005), Seoul, IEEE, Computer Society
Illingworth J, Kittler J (1988) A survey of the hough transform. Comput Vis Graph Image Process 44(1):87–116
Article Google Scholar
Jayant C, Renzelmann M, Wen D, Krisnandi S, Ladner RE, Comden D, Pontelli E, Trewin S (2007) Automated tactile graphics translation: in the field. In: Proceedings of the 9th International ACM SIGACCESS Conference on Computers and Accessibility, ASSETS 2007, Tempe. ACM, pp 75–82
Jiuzhou Z (2006) Creation of synthetic chart image database with ground truth. Honors year project report, National University of Singapore. https://www.comp.nus.edu.sg/tancl/ChartImageDatabase/Report_Zhaojiuzhou.pdf
Khurshid K, Siddiqi I, Faure C, Vincent N (2009) Comparison of Niblack inspired binarization methods for ancient documents. In: Berkner K, Likforman-Sulem L (eds) Document Recognition and Retrieval XVI, DRR 2009, 16th Document Recognition and Retrieval Conference, part of the IS&T-SPIE Electronic Imaging Symposium, San Jose. Proceedings, volume 7247 of SPIE Proceedings, pp 1–10. SPIE
Lu X, Kataria S, Brouwer WJ, Wang JZ, Mitra P, Giles CL (2009) Automated analysis of images in documents for intelligent document search. Int J Doc Anal Recogn (IJDAR) 12(2):65–81
Article Google Scholar
Lu S, Chen T, Tian S, Lim J, Tan CL (2015) Scene text extraction based on edges and support vector regression. Int J Doc Anal Recogn (IJDAR) 18(2):125–135
Article Google Scholar
Olszewska JI (2015) Active contour based optical character recognition for automated scene understanding. Neurocomputing 161:65–71
Article Google Scholar
Otsu N (1979) A threshold selection method from gray-level histograms. IEEE Trans Syst Man Cybern 9(1):62–66
Article Google Scholar
Samet H, Tamminen M (1988) Efficient component labeling of images of arbitrary dimension represented by linear bintrees. IEEE Trans Pattern Anal Mach Intell 10 (4):579–586
Article Google Scholar
Sas J, Zolnierek A (2013) Three-stage method of text region extraction from diagram raster images. In: Burduk R, Jackowski K, Kurzynski M, Wozniak M, Zolnierek A (eds) Proceedings of the 8th International Conference on Computer Recognition Systems CORES 2013, Milkow, volume 226 of Advances in Intelligent Systems and Computing. Springer, pp 527–538
Savva M, Kong N, Chhajta A, Li F, Agrawala M, Heer J (2011) Revision: automated classification, analysis and redesign of chart images. In: Pierce JS, Agrawala M, Klemmer SR (eds) Proceedings of the 24th Annual ACM Symposium on User Interface Software and Technology, Santa Barbara. ACM, pp 393–402
Strohmaier CM, Ringlstetter C, Schulz KU, Mihov S (2003) Lexical postcorrection of ocr-results: The web as a dynamic secondary dictionary? In: 7th International Conference on Document Analysis and Recognition (ICDAR 2003), 2-Volume Set, 3-6 August 2003, Edinburgh, IEEE Computer Society
Xu S, Krauthammer M (2010) A new pivoting and iterative text detection algorithm for biomedical images. J Biomed Inform 43:924–931
Article Google Scholar
Yang L, Huang W, Tan CL (2006) Semi-automatic ground truth generation for chart image recognition. In: Bunke H, Spitz AL (eds) Document Analysis Systems VII, 7th International Workshop, DAS 2006, Nelson, Proceedings, volume 3872 of Lecture Notes in Computer Science. Springer, pp 324–335

Download references

Acknowledgments

This research was co-financed by the EU H2020 project MOVING (http://www.moving-project.eu/) under contract no 693092. We thank ABBYY Europe GmbH for providing us with a test license of the ABBYY FineReader for our experiments.

Author information

Authors and Affiliations

Kiel University, Kiel, Germany
Falk Böschen & Tilman Beck
ZBW - Leibniz Information Centre for Economics, Kiel, Germany
Ansgar Scherp

Authors

Falk Böschen
View author publications
You can also search for this author in PubMed Google Scholar
Tilman Beck
View author publications
You can also search for this author in PubMed Google Scholar
Ansgar Scherp
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Falk Böschen.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Böschen, F., Beck, T. & Scherp, A. Survey and empirical comparison of different approaches for text extraction from scholarly figures. Multimed Tools Appl 77, 29475–29505 (2018). https://doi.org/10.1007/s11042-018-6162-7

Download citation

Received: 28 April 2017
Revised: 07 February 2018
Accepted: 16 May 2018
Published: 02 June 2018
Issue Date: November 2018
DOI: https://doi.org/10.1007/s11042-018-6162-7

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Survey and empirical comparison of different approaches for text extraction from scholarly figures

Abstract

Access this article

Similar content being viewed by others

Artificial intelligence to automate the systematic review of scientific literature

How to Check for Plagiarism?

A tale of two databases: the use of Web of Science and Scopus in academic papers

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Survey and empirical comparison of different approaches for text extraction from scholarly figures

Abstract

Access this article

Similar content being viewed by others

Artificial intelligence to automate the systematic review of scientific literature

How to Check for Plagiarism?

A tale of two databases: the use of Web of Science and Scopus in academic papers

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation