TEXUS: Table Extraction System for PDF Documents

  • Roya Rastan
  • Hye-Young Paik
  • John Shepherd
  • Seung Hwan Ryu
  • Amin Beheshti
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10837)

Abstract

Tables in documents are a rich and under-exploited source of structured data in otherwise unstructured documents. The extraction and understanding of tabular data is a challenging task which has attracted the attention of researchers from a range of disciplines such as information retrieval, machine learning and natural language processing. In this demonstration, we present an end-to-end table extraction and understanding system which takes a PDF file and automatically generates a set of XML and CSV files containing the extracted cells, rows and columns of tables, as well as a complete reading order analysis of the tables. Unlike many systems that work as a black-boxed, ad-hoc solution, our system design incorporates the open, reusable and extensible architecture to support research into, and development of, table-processing systems. During the demo, users will see how our system gradually transforms a PDF document into a set of structured files through a series of processing modules, namely: locating, segmenting and function/structure analysis.

Keywords

Table extraction TEXUS Table processing Information extraction Document processing 

References

  1. 1.
    e Silva, A.C., Jorge, A., Torgo, L.: Design of an end-to-end method to extract information from tables. Int. J. Doc. Anal. Recogn. 8(2–3), 144–171 (2006)CrossRefGoogle Scholar
  2. 2.
    Embley, D.W., Hurst, M., Lopresti, D., Nagy, G.: Table-processing paradigms: a research survey. IJDAR 8(2–3), 66–86 (2006)CrossRefGoogle Scholar
  3. 3.
    Rastan, R., Paik, H.-Y., Shepherd, J.: TEXUS: a task-based approach for table extraction and understanding. In: DocEng, pp. 25–34 (2015)Google Scholar
  4. 4.
    Rastan, R., Paik, H.-Y., Shepherd, J.: A PDF wrapper for table processing. In: DocEng, pp. 115–118 (2016)Google Scholar
  5. 5.
    Rastan, R., Paik, H.-Y., Shepherd, J., Haller, A.: Automated table understanding using stub patterns. In: Navathe, S.B., Wu, W., Shekhar, S., Du, X., Wang, X.S., Xiong, H. (eds.) DASFAA 2016. LNCS, vol. 9642, pp. 533–548. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-32025-0_33CrossRefGoogle Scholar
  6. 6.
    Wang, X.: Tabular abstraction, editing, and formatting. Ph.D. thesis, University of Waterloo (1996)Google Scholar

Copyright information

© Springer International Publishing AG, part of Springer Nature 2018

Authors and Affiliations

  • Roya Rastan
    • 1
  • Hye-Young Paik
    • 1
  • John Shepherd
    • 1
  • Seung Hwan Ryu
    • 1
  • Amin Beheshti
    • 2
  1. 1.School of Computer Science and EngineeringUniversity of New South WalesSydneyAustralia
  2. 2.Department of ComputingMacquarie UniversitySydneyAustralia

Personalised recommendations