Skip to main content

A Comparison of Approaches for Automated Text Extraction from Scholarly Figures

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 10132))

Abstract

So far, there has not been a comparative evaluation of different approaches for text extraction from scholarly figures. In order to fill this gap, we have defined a generic pipeline for text extraction that abstracts from the existing approaches as documented in the literature. In this paper, we use this generic pipeline to systematically evaluate and compare 32 configurations for text extraction over four datasets of scholarly figures of different origin and characteristics. In total, our experiments have been run over more than 400 manually labeled figures. The experimental results show that the approach BS-4OS results in the best F-measure of 0.67 for the Text Location Detection and the best average Levenshtein Distance of 4.71 between the recognized text and the gold standard on all four datasets using the Ocropy OCR engine.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    http://www.kd.informatik.uni-kiel.de/en/research/software/text-extraction.

  2. 2.

    https://github.com/tesseract-ocr/.

  3. 3.

    http://www.abbyy.com/ocr-sdk/.

  4. 4.

    https://github.com/tmbdev/ocropy.

  5. 5.

    https://www.econbiz.de/.

  6. 6.

    http://www.degruyter.com/.

  7. 7.

    http://www.degruyter.com/dg/page/open-access-policy.

  8. 8.

    https://www.comp.nus.edu.sg/~tancl/ChartImageDataset.htm.

References

  1. Böschen, F., Scherp, A.: A systematic comparison of different approaches for unsupervised extraction of text from scholarly figures [extended report]. Technical report 1607, Christian-Albrechts-Universität zu Kiel (2016). http://www.uni-kiel.de/journals/receive/jportal_jparticle_00000290

  2. Böschen, F., Scherp, A.: Formalization and preliminary evaluation of a pipeline for text extraction from infographics. In: Bergmann, R., Görg, S., Müller, G. (eds.) LWA 2015 Workshop: KDML, pp. 20–31. CEUR (2015)

    Google Scholar 

  3. Böschen, F., Scherp, A.: Multi-oriented text extraction from information graphics. In: DocEng, pp. 35–38. ACM (2015)

    Google Scholar 

  4. Carberry, S., Elzer, S., Demir, S.: Information graphics: an untapped resource for digital libraries. In: SIGIR, pp. 581–588. ACM (2006)

    Google Scholar 

  5. Chiang, Y., Knoblock, C.A.: A general approach for extracting road vector data from raster maps. IJDAR 16(1), 55–81 (2013)

    Article  Google Scholar 

  6. Chiang, Y., Knoblock, C.A.: Recognizing text in raster maps. GeoInformatica 19(1), 1–27 (2015)

    Article  Google Scholar 

  7. Choudhury, S.R., Giles, C.L.: An architecture for information extraction from figures in digital libraries. In: WWW, pp. 667–672 (2015)

    Google Scholar 

  8. Fraz, M., Sarfraz, M.S., Edirisinghe, E.A.: Exploiting colour information for better scene text detection and recognition. IJDAR 18(2), 153–167 (2015)

    Article  Google Scholar 

  9. Huang, W., Tan, C.L., Leow, W.K.: Associating text and graphics for scientific chart understanding. In: ICDAR, pp. 580–584. IEEE Computer Society (2005)

    Google Scholar 

  10. Jayant, C., Renzelmann, M., Wen, D., Krisnandi, S., Ladner, R.E., Comden, D.: Automated tactile graphics translation: in the field. In: ASSETS, pp. 75–82 (2007)

    Google Scholar 

  11. Jiuzhou, Z.: Creation of synthetic chart image database with ground truth. Honors year project report, National University of Singapore (2006). https://www.comp.nus.edu.sg/~tancl/ChartImageDatabase/Report_Zhaojiuzhou.pdf

  12. Karatzas, D., Gomez-Bigorda, L., Nicolaou, A., Ghosh, S.K., Bagdanov, A.D., Iwamura, M., Matas, J., Neumann, L., Chandrasekhar, V.R., Lu, S., Shafait, F., Uchida, S., Valveny, E.: ICDAR 2015 competition on robust reading. In: ICDAR, 23–26 August 2015, pp. 1156–1160. IEEE Computer Society (2015)

    Google Scholar 

  13. Khurshid, K., Siddiqi, I., Faure, C., Vincent, N.: Comparison of Niblack inspired binarization methods for ancient documents. In: Document Recognition and Retrieval (DRR), pp. 1–10. SPIE (2009)

    Google Scholar 

  14. Lu, X., Kataria, S., Brouwer, W.J., Wang, J.Z., Mitra, P., Giles, C.L.: Automated analysis of images in documents for intelligent document search. IJDAR 12(2), 65–81 (2009)

    Article  Google Scholar 

  15. Otsu, N.: A threshold selection method from gray-level histograms. TSMC 9(1), 62–66 (1979)

    MathSciNet  Google Scholar 

  16. Samet, H., Tamminen, M.: Efficient component labeling of images of arbitrary dimension represented by linear bintrees. IEEE TPAMI 10(4), 579–586 (1988)

    Article  Google Scholar 

  17. Sas, J., Zolnierek, A.: Three-stage method of text region extraction from diagram raster images. In: Burduk, R., Jackowski, K., Kurzynski, M., Wozniak, M., Zolnierek, A. (eds.) Proceedings of the 8th International Conference on Computer Recognition Systems CORES 2013, vol. 226, pp. 527–538. Springer, Heidelberg (2013)

    Chapter  Google Scholar 

  18. Savva, M., Kong, N., Chhajta, A., Fei-Fei, L., Agrawala, M., Heer, J.: ReVision: automated classification, analysis and redesign of chart images. In: UIST, pp. 393–402. ACM (2011)

    Google Scholar 

  19. Xu, S., Krauthammer, M.: A new pivoting and iterative text detection algorithm for biomedical images. J. Biomed. Inform. 43, 924–931 (2010)

    Article  Google Scholar 

  20. Yang, L., Huang, W., Tan, C.L.: Semi-automatic ground truth generation for chart image recognition. In: Bunke, H., Spitz, A.L. (eds.) DAS 2006. LNCS, vol. 3872, pp. 324–335. Springer, Heidelberg (2006). doi:10.1007/11669487_29

    Chapter  Google Scholar 

Download references

Acknowledgement

This research was co-financed by the EU H2020 project MOVING (http://www.moving-project.eu/) under contract no 693092.

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Falk Böschen or Ansgar Scherp .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer International Publishing AG

About this paper

Cite this paper

Böschen, F., Scherp, A. (2017). A Comparison of Approaches for Automated Text Extraction from Scholarly Figures. In: Amsaleg, L., Guðmundsson, G., Gurrin, C., Jónsson, B., Satoh, S. (eds) MultiMedia Modeling. MMM 2017. Lecture Notes in Computer Science(), vol 10132. Springer, Cham. https://doi.org/10.1007/978-3-319-51811-4_2

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-51811-4_2

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-51810-7

  • Online ISBN: 978-3-319-51811-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics