Advertisement

Document Layout Analysis for Semantic Information Extraction

  • Weronika T. Adrian
  • Nicola Leone
  • Marco Manna
  • Cinzia Marte
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10640)

Abstract

Using machines to automatically extract relevant information from unstructured and semi-structured sources has practical significance in todays life and business. In this context, although understanding the meaning of words is important, the process of identifying self-consistent geometric and logical regions of interest—blocks, cells, columns and tables, as well as paragraphs, titles and captions, only to mention a few—is of paramount importance too. This complex process goes under the name of document layout analysis. In this work, we discuss newly designed techniques to solve this problem effectively, by combining both syntactic and semantic document aspects. These techniques described here are at the basis of KnowRex, a comprehensive system for ontology-driven Information Extraction.

Keywords

Document Layout Analysis Information Extraction Table recognition Answer Set Programming Ontologies Knowledge representation 

References

  1. 1.
    Adrian, W.T., Leone, N., Manna, M.: Semantic views of homogeneous unstructured data. In: ten Cate, B., Mileo, A. (eds.) RR 2015. LNCS, vol. 9209, pp. 19–29. Springer, Cham (2015).  https://doi.org/10.1007/978-3-319-22002-4_3 CrossRefGoogle Scholar
  2. 2.
    Anantharangachar, R., Ramani, S., Rajagopalan, S.: Ontology guided information extraction from unstructured text. CoRR abs/1302.1335 (2013)Google Scholar
  3. 3.
    Antonacopoulos, A., Clausner, C., Papadopoulos, C., Pletschacher, S.: Historical document layout analysis competition. In: Proceedings of ICDAR 2011, pp. 1516–1520. IEEE (2011)Google Scholar
  4. 4.
    Apostolova, E., Tomuro, N.: Combining visual and textual features for information extraction from online flyers. In: Proceedings of EMNLP, pp. 1924–1929 (2014)Google Scholar
  5. 5.
    Baird, H.S., Jones, S.E., Fortune, S.J.: Image segmentation by shape-directed covers. In: Proceedings of ICPR, vol. 1, pp. 820–825. IEEE (1990)Google Scholar
  6. 6.
    Balke, W.T.: Introduction to information extraction: basic notions and current trends. Datenbank-Spektrum 12(2), 81–88 (2012)CrossRefGoogle Scholar
  7. 7.
    Brewka, G., Eiter, T., Truszczynski, M.: Answer set programming at a glance. Commun. ACM 54(12), 92–103 (2011)CrossRefGoogle Scholar
  8. 8.
    Cao, H., Prasad, R., Natarajan, P., MacRostie, E.: Robust page segmentation based on smearing and error correction unifying top-down and bottom-up approaches. In: Proceedings of ICDAR 2007, vol. 1, pp. 392–396. IEEE (2007)Google Scholar
  9. 9.
    Cattoni, R., Coianiz, T., Messelodi, S., Modena, C.: Geometric layout analysis techniques for document image understanding: a review. In: IRST, Trento, Italy (1998)Google Scholar
  10. 10.
    Corbelli, A., Baraldi, L., Grana, C., Cucchiara, R.: Historical document digitization through layout analysis and deep content classification. In: Proceedings of ICPR 2016, pp. 4077–4082. IEEE (2016)Google Scholar
  11. 11.
    Della Penna, G., Orefice, S.: Supporting information extraction from visual documents. J. Comput. Commun. 4(06), 36 (2016)CrossRefGoogle Scholar
  12. 12.
    Flesca, S., Masciari, E., Tagarelli, A.: A fuzzy logic approach to wrapping pdf documents. IEEE Trans. Knowl. Data Eng. 23(12), 1826–1841 (2011)CrossRefGoogle Scholar
  13. 13.
    Jain, A.K., Yu, B.: Document representation and its application to page decomposition. IEEE Trans. Pattern Anal. Mach. Intell. 20(3), 294–308 (1998)CrossRefGoogle Scholar
  14. 14.
    Jiang, J.: Information extraction from text. In: Aggarwal, C., Zhai, C. (eds.) Mining Text Data, pp. 11–41. Springer, Boston (2012).  https://doi.org/10.1007/978-1-4614-3223-4_2 CrossRefGoogle Scholar
  15. 15.
    Karkaletsis, V., Fragkou, P., Petasis, G., Iosif, E.: Ontology based information extraction from text. In: Paliouras, G., Spyropoulos, C.D., Tsatsaronis, G. (eds.) Knowledge-Driven Multimedia Information Extraction and Ontology Evolution. LNCS, vol. 6050, pp. 89–109. Springer, Heidelberg (2011).  https://doi.org/10.1007/978-3-642-20795-2_4 CrossRefGoogle Scholar
  16. 16.
    Kieninger, T.G.: Table structure recognition based on robust block segmentation. In: Photonics West 1998 Electronic Imaging, pp. 22–32. International Society for Optics and Photonics (1998)Google Scholar
  17. 17.
    Kise, K., Sato, A., Iwata, M.: Segmentation of page images using the area voronoi diagram. Comput. Vis. Image Underst. 70(3), 370–382 (1998)CrossRefGoogle Scholar
  18. 18.
    Lipinski, M., Yao, K., Breitinger, C., Beel, J., Gipp, B.: Evaluation of header metadata extraction approaches and tools for scientific PDF documents. In: Proceedings of JCDL 2013, pp. 385–386. ACM, New York (2013)Google Scholar
  19. 19.
    Nagy, G., Seth, S., Viswanathan, M.: A prototype document image analysis system for technical journals. Computer 25(7), 10–22 (1992)CrossRefGoogle Scholar
  20. 20.
    Namboodiri, A.M., Jain, A.K.: Document structure and layout analysis. In: Chaudhuri, B.B. (ed.) Digital Document Processing, pp. 29–48. Springer, London (2007).  https://doi.org/10.1007/978-1-84628-726-8_2 CrossRefGoogle Scholar
  21. 21.
    O’Gorman, L.: The document spectrum for page layout analysis. IEEE Trans. Pattern Anal. Mach. Intell. 15(11), 1162–1173 (1993)CrossRefGoogle Scholar
  22. 22.
    Oren, E., Möller, K., Scerri, S., Handschuh, S., Sintek, M.: What are semantic annotations. Relatório técnico. DERI Galway 9, 62 (2006)Google Scholar
  23. 23.
    Piskorski, J., Yangarber, R.: Information extraction: past, present and future. In: Poibeau, T., Saggion, H., Piskorski, J., Yangarber, R. (eds.) Multi-source, Multilingual Information Extraction and Summarization, pp. 23–49. Springer, Heidelberg (2013).  https://doi.org/10.1007/978-3-642-28569-1_2 CrossRefGoogle Scholar
  24. 24.
    Simon, A., Pret, J.C., Johnson, A.P.: A fast algorithm for bottom-up document layout analysis. IEEE Trans. Pattern Anal. Mach. Intell. 19(3), 273–277 (1997)CrossRefGoogle Scholar
  25. 25.
    Singh, M., Barua, B., Palod, P., Garg, M., Satapathy, S., Bushi, S., Ayush, K., Rohith, K.S., Gamidi, T., Goyal, P., et al.: OCR++: a robust framework for information extraction from scholarly articles. arXiv preprint arXiv:1609.06423 (2016)
  26. 26.
    Toepfer, M., Corovic, H., Fette, G., Klügl, P., Störk, S., Puppe, F.: Fine-grained information extraction from German transthoracic echocardiography reports. BMC Med. Inform. Decis. Mak. 15(1), 91 (2015)CrossRefGoogle Scholar
  27. 27.
    Vasilopoulos, N., Kavallieratou, E.: Unified layout analysis and text localization framework. J. Electron. Imaging 26(1), 013009 (2017)CrossRefGoogle Scholar
  28. 28.
    Wong, K.Y., Casey, R.G., Wahl, F.M.: Document analysis system. IBM J. Res. Dev. 26(6), 647–656 (1982)CrossRefGoogle Scholar
  29. 29.
    Zanibbi, R., Blostein, D., Cordy, J.R.: A survey of table recognition. Doc. Anal. Recogn. 7(1), 1–16 (2004)Google Scholar

Copyright information

© Springer International Publishing AG 2017

Authors and Affiliations

  • Weronika T. Adrian
    • 1
    • 2
  • Nicola Leone
    • 1
  • Marco Manna
    • 1
  • Cinzia Marte
    • 1
  1. 1.Department of Mathematics and Computer ScienceUniversity of CalabriaRendeItaly
  2. 2.AGH University of Science and TechnologyKrakowPoland

Personalised recommendations