TEXUS: Table Extraction System for PDF Documents

Rastan, Roya; Paik, Hye-Young; Shepherd, John; Ryu, Seung Hwan; Beheshti, Amin

doi:10.1007/978-3-319-92013-9_30

TEXUS: Table Extraction System for PDF Documents

Roya Rastan¹⁷,
Hye-Young Paik¹⁷,
John Shepherd¹⁷,
Seung Hwan Ryu¹⁷ &
…
Amin Beheshti¹⁸

Conference paper
First Online: 18 May 2018

1254 Accesses
8 Citations

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 10837))

Abstract

Tables in documents are a rich and under-exploited source of structured data in otherwise unstructured documents. The extraction and understanding of tabular data is a challenging task which has attracted the attention of researchers from a range of disciplines such as information retrieval, machine learning and natural language processing. In this demonstration, we present an end-to-end table extraction and understanding system which takes a PDF file and automatically generates a set of XML and CSV files containing the extracted cells, rows and columns of tables, as well as a complete reading order analysis of the tables. Unlike many systems that work as a black-boxed, ad-hoc solution, our system design incorporates the open, reusable and extensible architecture to support research into, and development of, table-processing systems. During the demo, users will see how our system gradually transforms a PDF document into a set of structured files through a series of processing modules, namely: locating, segmenting and function/structure analysis.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

1.
It is generally accepted that tables can be understood if one can detect the hierarchical structure of table headers properly and determine how each table data cell can be uniquely accessed through them.
2.
http://www.foolabs.com/xpdf.
3.
From the PDF specification (http://www.adobe.com/devnet/pdf/pdf_reference.html.
4.
http://www.tamirhassan.com/competition.html.

References

e Silva, A.C., Jorge, A., Torgo, L.: Design of an end-to-end method to extract information from tables. Int. J. Doc. Anal. Recogn. 8(2–3), 144–171 (2006)
Article Google Scholar
Embley, D.W., Hurst, M., Lopresti, D., Nagy, G.: Table-processing paradigms: a research survey. IJDAR 8(2–3), 66–86 (2006)
Article Google Scholar
Rastan, R., Paik, H.-Y., Shepherd, J.: TEXUS: a task-based approach for table extraction and understanding. In: DocEng, pp. 25–34 (2015)
Google Scholar
Rastan, R., Paik, H.-Y., Shepherd, J.: A PDF wrapper for table processing. In: DocEng, pp. 115–118 (2016)
Google Scholar
Rastan, R., Paik, H.-Y., Shepherd, J., Haller, A.: Automated table understanding using stub patterns. In: Navathe, S.B., Wu, W., Shekhar, S., Du, X., Wang, X.S., Xiong, H. (eds.) DASFAA 2016. LNCS, vol. 9642, pp. 533–548. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-32025-0_33
Chapter Google Scholar
Wang, X.: Tabular abstraction, editing, and formatting. Ph.D. thesis, University of Waterloo (1996)
Google Scholar

Download references

Author information

Authors and Affiliations

School of Computer Science and Engineering, University of New South Wales, Sydney, Australia
Roya Rastan, Hye-Young Paik, John Shepherd & Seung Hwan Ryu
Department of Computing, Macquarie University, Sydney, Australia
Amin Beheshti

Authors

Roya Rastan
View author publications
You can also search for this author in PubMed Google Scholar
Hye-Young Paik
View author publications
You can also search for this author in PubMed Google Scholar
John Shepherd
View author publications
You can also search for this author in PubMed Google Scholar
Seung Hwan Ryu
View author publications
You can also search for this author in PubMed Google Scholar
Amin Beheshti
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Seung Hwan Ryu .

Editor information

Editors and Affiliations

ICT, Griffith University, Southport, Queensland, Australia
Junhu Wang
Nanyang Technological University, Singapore, Singapore
Gao Cong
Faculty of Information and Communication Technologies, Swinburne University of Technology, Hawthorn, Victoria, Australia
Jinjun Chen
The University of Melbourne, Melbourne, Victoria, Australia
Jianzhong Qi

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Rastan, R., Paik, HY., Shepherd, J., Ryu, S.H., Beheshti, A. (2018). TEXUS: Table Extraction System for PDF Documents. In: Wang, J., Cong, G., Chen, J., Qi, J. (eds) Databases Theory and Applications. ADC 2018. Lecture Notes in Computer Science(), vol 10837. Springer, Cham. https://doi.org/10.1007/978-3-319-92013-9_30

Download citation

DOI: https://doi.org/10.1007/978-3-319-92013-9_30
Published: 18 May 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-92012-2
Online ISBN: 978-3-319-92013-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics