Abstract
Different approaches have been proposed in the past to address the challenge of extracting text from scholarly figures. However, until recently, no comparative evaluation of the different approaches had been conducted. Thus, we performed an extensive study of the related work and evaluated in total 32 different approaches. In this work, we perform a more detailed comparison of the 7 most relevant approaches described in the literature and extend to 37 systematic linear combinations of methods for extracting text from scholarly figures. Our generic pipeline, consisting of six steps, allows us to freely combine the different possible methods and perform a fair comparison. Overall, we have evaluated 44 different linear pipeline configurations and systematically compared the different methods. We then derived two non-linear configurations and a two-pass approach. We evaluate all pipeline configurations over four datasets of scholarly figures of different origin and characteristics. The quality of the extraction results is assessed using F-measure and Levenshtein distance, and we measure the runtime performance. Our experiments showed that there is a linear configuration that overall shows the best text extraction quality on all datasets. Further experiments showed that the best configuration can be improved by extending it to a two-pass approach. Regarding the runtime, we observed huge differences from very fast approaches to those running for several weeks. Our experiments found the best working configuration for text extraction from our method set. However, they also showed that further improvements regarding region extraction and classification are needed.
Similar content being viewed by others
Notes
http://www.kd.informatik.uni-kiel.de/en/research/software/text-extraction, last access: September, 2017
http://www.degruyter.com/, last access: September, 2017
http://www.kd.informatik.uni-kiel.de/en/research/software/text-extraction, last access: September, 2017
https://github.com/tesseract-ocr/, last access: September, 2017
http://www.abbyy.com/ocr-sdk/, last access: September, 2017
http://www-e.uni-magdeburg.de/jschulen/ocr/index.html, last access: September, 2017
https://github.com/tmbdev/ocropy, last access: September, 2017
https://www.abbyy.com/en-us/ocr-sdk/, last access: September, 2017
https://www.econbiz.de/, last access: September, 2017
http://www.degruyter.com/, last access: September, 2017
http://www.degruyter.com/dg/page/open-access-policy, last access: September, 2017
https://www.comp.nus.edu.sg/tancl/ChartImageDataset.htm, last access: September, 2017
References
Böschen F, Scherp A (2015) Formalization and preliminary evaluation of a pipeline for text extraction from infographics. In: Bergmann R, Görg S, Müller G (eds) Proceedings of the LWA 2015 Workshops: KDML, FGWM, IR, and FGDB. CEUR-WS.org. volume 1458 of CEUR Workshop Proceedings, Trier, pp 20–31
Böschen F, Scherp A (2015) Multi-oriented text extraction from information graphics. In: Vanoirbeek C, Genevés P (eds) Proceedings of the 2015 ACM Symposium on Document Engineering, DocEng 2015. ACM, Lausanne, pp 35–38
Böschen F, Scherp A (2017) A comparison of approaches for automated text extraction from scholarly figures. In: MultiMedia Modeling - 23rd International Conference, MMM 2017, Reykjavik, Proceedings, Part I, volume 10132 of Lecture Notes in Computer Science. Springer, pp 15–27
Carberry S, Elzer S, Demir S (2006) Information graphics: an untapped resource for digital libraries. In: Efthimiadis EN, Dumais ST, Hawking D, Järvelin K (eds) SIGIR 2006: Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, Seattle, pp 581–588
Carberry S, Schwartz SE, McCoy KF, Demir S, Wu P, Greenbacker CF, Chester D, Schwartz E, Oliver D, Moraes PS (2012) Access to multimodal articles for individuals with sight impairments. ACM Trans Interact Intell Syst 2(4):21
Chen Z, Cafarella MJ, Adar E (2015) Diagramflyer: A search engine for data-driven diagrams. In: Gangemi A, Leonardi S, Panconesi A (eds) Proceedings of the 24th International Conference on World Wide Web Companion, WWW 2015, Florence, 2015 - Companion Volume. ACM, pp 183–186
Chester D, Elzer S (2005) Getting computers to see information graphics so users do not have to. In: Hacid M, Murray NV, Ras ZW, Tsumoto S (eds) editors, Foundations of Intelligent Systems, 15th International Symposium, ISMIS 2005, Saratoga Springs, Proceedings, volume 3488 of Lecture Notes in Computer Science. Springer, pp 660–668
Chiang Y, Knoblock CA (2013) A general approach for extracting road vector data from raster maps. Int J Doc Anal Recogn (IJDAR) 16(1):55–81
Chiang Y, Knoblock CA (2015) Recognizing text in raster maps. GeoInformatica 19(1):1–27
Choudhury SR, Giles CL (2015) An architecture for information extraction from figures in digital libraries. In: Gangemi A, Leonardi S, Panconesi A (eds) Proceedings of the 24th International Conference on World Wide Web Companion, WWW 2015, Florence, 2015 - Companion Volume. ACM, pp 667–672
Deseilligny MP, Men HL, Stamon G (1995) Character string recognition on maps, a rotation-invariant recognition method. Pattern Recogn Lett 16(12):1297–1310
Fraz M, Sarfraz MS, Edirisinghe EA (2015) Exploiting colour information for better scene text detection and recognition. Int J Doc Anal Recogn (IJDAR) 18 (2):153–167
Gao G, Zhang H, Chen H (2015) A robust video text extraction and recognition approach using OCR feedback information. In: Ho Y, Sang J, Ro YM, Kim J, Wu F (eds) Advances in Multimedia Information Processing - PCM 2015 - 16th Pacific-Rim Conference on Multimedia, Gwangju, Proceedings, Part I, volume 9314 of Lecture Notes in Computer Science. Springer, pp 507–517
Gllavata J, Freisleben B (2005) Adaptive fuzzy text segmentation in images with complex backgrounds using color and texture. In: Gagalowicz A, Philips W (eds) Computer Analysis of Images and Patterns, 11th International Conference, CAIP 2005, Versailles, Proceedings, volume 3691 of Lecture Notes in Computer Science. Springer, pp 756–765
Huang W, Tan CL, King PR, Simske SJ (2007) A system for understanding imaged infographics and its applications. In: Proceedings of the 2007 ACM Symposium on Document Engineering. ACM, Winnipeg, pp 9–18
Huang W, Tan CL, Leow WK (2005) Associating text and graphics for scientific chart understanding. In: Eighth International Conference on Document Analysis and Recognition (ICDAR 2005), Seoul, IEEE, Computer Society
Illingworth J, Kittler J (1988) A survey of the hough transform. Comput Vis Graph Image Process 44(1):87–116
Jayant C, Renzelmann M, Wen D, Krisnandi S, Ladner RE, Comden D, Pontelli E, Trewin S (2007) Automated tactile graphics translation: in the field. In: Proceedings of the 9th International ACM SIGACCESS Conference on Computers and Accessibility, ASSETS 2007, Tempe. ACM, pp 75–82
Jiuzhou Z (2006) Creation of synthetic chart image database with ground truth. Honors year project report, National University of Singapore. https://www.comp.nus.edu.sg/tancl/ChartImageDatabase/Report_Zhaojiuzhou.pdf
Khurshid K, Siddiqi I, Faure C, Vincent N (2009) Comparison of Niblack inspired binarization methods for ancient documents. In: Berkner K, Likforman-Sulem L (eds) Document Recognition and Retrieval XVI, DRR 2009, 16th Document Recognition and Retrieval Conference, part of the IS&T-SPIE Electronic Imaging Symposium, San Jose. Proceedings, volume 7247 of SPIE Proceedings, pp 1–10. SPIE
Lu X, Kataria S, Brouwer WJ, Wang JZ, Mitra P, Giles CL (2009) Automated analysis of images in documents for intelligent document search. Int J Doc Anal Recogn (IJDAR) 12(2):65–81
Lu S, Chen T, Tian S, Lim J, Tan CL (2015) Scene text extraction based on edges and support vector regression. Int J Doc Anal Recogn (IJDAR) 18(2):125–135
Olszewska JI (2015) Active contour based optical character recognition for automated scene understanding. Neurocomputing 161:65–71
Otsu N (1979) A threshold selection method from gray-level histograms. IEEE Trans Syst Man Cybern 9(1):62–66
Samet H, Tamminen M (1988) Efficient component labeling of images of arbitrary dimension represented by linear bintrees. IEEE Trans Pattern Anal Mach Intell 10 (4):579–586
Sas J, Zolnierek A (2013) Three-stage method of text region extraction from diagram raster images. In: Burduk R, Jackowski K, Kurzynski M, Wozniak M, Zolnierek A (eds) Proceedings of the 8th International Conference on Computer Recognition Systems CORES 2013, Milkow, volume 226 of Advances in Intelligent Systems and Computing. Springer, pp 527–538
Savva M, Kong N, Chhajta A, Li F, Agrawala M, Heer J (2011) Revision: automated classification, analysis and redesign of chart images. In: Pierce JS, Agrawala M, Klemmer SR (eds) Proceedings of the 24th Annual ACM Symposium on User Interface Software and Technology, Santa Barbara. ACM, pp 393–402
Strohmaier CM, Ringlstetter C, Schulz KU, Mihov S (2003) Lexical postcorrection of ocr-results: The web as a dynamic secondary dictionary? In: 7th International Conference on Document Analysis and Recognition (ICDAR 2003), 2-Volume Set, 3-6 August 2003, Edinburgh, IEEE Computer Society
Xu S, Krauthammer M (2010) A new pivoting and iterative text detection algorithm for biomedical images. J Biomed Inform 43:924–931
Yang L, Huang W, Tan CL (2006) Semi-automatic ground truth generation for chart image recognition. In: Bunke H, Spitz AL (eds) Document Analysis Systems VII, 7th International Workshop, DAS 2006, Nelson, Proceedings, volume 3872 of Lecture Notes in Computer Science. Springer, pp 324–335
Acknowledgments
This research was co-financed by the EU H2020 project MOVING (http://www.moving-project.eu/) under contract no 693092. We thank ABBYY Europe GmbH for providing us with a test license of the ABBYY FineReader for our experiments.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Böschen, F., Beck, T. & Scherp, A. Survey and empirical comparison of different approaches for text extraction from scholarly figures. Multimed Tools Appl 77, 29475–29505 (2018). https://doi.org/10.1007/s11042-018-6162-7
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-018-6162-7