Advertisement

Extraction of Referential Heading-Entries in Recognized Table of Contents Pages

  • Phuc Tri NguyenEmail author
  • Dang Tuan Nguyen
Conference paper
Part of the Advances in Intelligent Systems and Computing book series (AISC, volume 348)

Abstract

This paper presents our research focusing on extracting referential heading-entries in recognized table of contents (TOC) pages. This task encounters two issues: the complexity of layouts (e.g., a referential heading-entry can have one or many lines, with “decorate” texts, etc.), and some text data errors caused by OCR processing in training data. Our approach uses several layout-based and content-based features to classify textual lines of TOC pages in datasets. Also, we propose synthesis rules to combine related and classified lines into identify referential heading-entries. The experiments are conducted on ICDAR Book Structure Extraction Datasets 2009, 2011, and 2013. The results of experiments show that proposed approach is more efficient than previous methods of referential heading-entries extraction.

Keywords

table of content recognition document structure extraction referential heading-entries extraction 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Doucet, A., Kazai, G., Colutto, S., Mühlberger, G.: Overview of the ICDAR 2013 Competition on Book Structure Extraction. In: Proceedings of the Twelfth International Conference on Document Analysis and Recognition (ICDAR 2013), Washington DC, USA, p. 6 (2013)Google Scholar
  2. 2.
    Liu, C., Chen, J., Zhang, X., Liu, J., Huang, Y.: TOC Structure Extraction from OCR-ed Books. In: Geva, S., Kamps, J., Schenkel, R. (eds.) INEX 2011. LNCS, vol. 7424, pp. 98–108. Springer, Heidelberg (2012)CrossRefGoogle Scholar
  3. 3.
    Gander, L., Lezuo, C., Unterweger, R.: Rule based document understanding of historical books using a hybrid fuzzy classification system. In: Proceedings of the 2011 Workshop on Historical Document Imaging and Processing, HIP 2011, pp. 91–97. ACM, New York (2011)CrossRefGoogle Scholar
  4. 4.
    Lazzara, G., Levillain, R., Géraud, T., Jacquelet, Y., Marquegnies, J., Crépin-Leblond, A.: The scribo module of the olena platform: A free software framework for document image analysis. In: Proceedings of the Eleventh International Conference on Document Analysis and Recognition (ICDAR 2011), pp. 252–258 (2011)Google Scholar
  5. 5.
    Dresevic, B., Uzelac, A., Radakovic, B., Todic, N.: Book layout analysis: Toc structure extraction engine. In: Geva, S., Kamps, J., Trotman, A. (eds.) INEX 2008. LNCS, vol. 5631, pp. 164–171. Springer, Heidelberg (2009)CrossRefGoogle Scholar
  6. 6.
    Doucet, A., Kazai, G., Dresevic, B., Uzelac, A., Radakovic, B., Todic, N.: ICDAR 2009 Book Structure Extraction Competition. In: Proceedings of the Tenth International Conference on Document Analysis and Recognition (ICDAR 2009), Barcelona, Spain, pp. 1408–1412 (2009)Google Scholar
  7. 7.
    Doucet, A., Kazai, G., Meunier, J.L.: ICDAR 2011 Book Structure Extraction Competition. In: Proceedings of the Eleventh International Conference on Document Analysis and Recognition (ICDAR 2011), Beijing, China, pp. 1501–1505 (2011)Google Scholar
  8. 8.
    Doucet, A., Kazai, G., Dresevic, B., Uzelac, A., Radakovic, B., Todic, N.: Setting up a competition framework for the evaluation of structure extraction from ocr-ed books. International Journal of Document Analysis and Recognition (IJDAR), Special Issue on Performance Evaluation of Document Analysis and Recognition Algorithms 14, 45–52 (2011)CrossRefGoogle Scholar
  9. 9.
    Chang, C.C., Lin, C.J.: LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology 2, 27:1–27:27 (2011), http://www.csie.ntu.edu.tw/~cjlin/libsvm

Copyright information

© Springer International Publishing Switzerland 2015

Authors and Affiliations

  1. 1.Faculty of Computer ScienceUniversity of Information Technology, VNU-HCMHo Chi Minh CityVietnam

Personalised recommendations