TabbyPDF: Web-Based System for PDF Table Extraction

Shigarov, Alexey; Altaev, Andrey; Mikhailov, Andrey; Paramonov, Viacheslav; Cherkashin, Evgeniy

doi:10.1007/978-3-319-99972-2_20

TabbyPDF: Web-Based System for PDF Table Extraction

Alexey Shigarov^11,12,
Andrey Altaev¹¹,
Andrey Mikhailov¹¹,
Viacheslav Paramonov^11,12 &
…
Evgeniy Cherkashin^11,12

Conference paper
First Online: 29 August 2018

1351 Accesses
14 Citations
7 Altmetric

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 920))

Abstract

PDF is one of the most widespread ways to represent non-editable documents. Many of PDF documents are machine-readable but remain untagged. They have no tags for identifying layout items such as paragraphs, columns, or tables. One of the important challenges with these documents is how to extract tabular data from them. The paper presents a novel web-based system for extracting tables located in untagged PDF documents with a complex layout, for recovering their cell structures, and for exporting them into a tagged form (e.g. in CSV or HTML format). The system uses a heuristic-based approach to table detection and structure recognition. It mainly relies on recovering a human reading order of text, including document paragraphs and table cells. A prototype of the system was evaluated, using the methodology and dataset of “ICDAR 2013 Table Competition”. The standard metric F-score is 93.64% for the structure recognition phase and 83.18% for the table extraction with automatic table detection. The results are comparable with the state-of-the-art academic solutions.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

1.
https://www.pdfa.org/pdf-in-2016-broader-deeper-richer.
2.
http://pdftohtml.sourceforge.net.
3.
http://tabula.technology.
4.
https://semantic-ui.com.
5.
https://mozilla.github.io/pdf.js.
6.
https://spring.io.
7.
http://tomcat.apache.org.
8.
https://db.apache.org/derby.
9.
TabbyPDF core: https://github.com/cellsrg/tabbypdf.
TabbyPDF client: https://github.com/cellsrg/tabbypdf-front.
TabbyPDF server: https://github.com/cellsrg/tabbypdf-web.
10.
http://cells.icc.ru/pdfte.
11.
http://www.tamirhassan.com/dataset.html.
12.
http://tamirhassan.com/competition/dataset-tools.html.
13.
https://sourceforge.net/projects/itext.

References

Burdick, D., et al.: Financial analytics from public data. In: Proceedings of the International Workshop on Data Science for Macro-Modeling, DSMM 2014, pp. 4:1–4:6 (2014). https://doi.org/10.1145/2630729.2630742
Corrêa, A.S., Zander, P.O.: Unleashing tabular content to open data: a survey on PDF table extraction methods and tools. In: Proceedings of 18th International Conference on Digital Government Research, pp. 54–63 (2017). https://doi.org/10.1145/3085228.3085278
Coüasnon, B., Lemaitre, A.: Recognition of tables and forms. In: Handbook of Document Image Processing and Recognition, pp. 647–677 (2014). https://doi.org/10.1007/978-0-85729-859-1_20
Chapter Google Scholar
Göbel, M., Hassan, T., Oro, E., Orsi, G.: ICDAR 2013 table competition. In: Proceedings of 12th International Conference on Document Analysis and Recognition, pp. 1449–1453 (2013)
Google Scholar
Göbel, M., Hassan, T., Oro, E., Orsi, G.: A methodology for evaluating algorithms for table understanding in PDF documents. In: Proceedings of 2012 ACM Symposium on Document Engineering, pp. 45–48 (2012). https://doi.org/10.1145/2361354.2361365
Göbel, M., Hassan, T., Oro, E., Orsi, G., Rastan, R.: Table modelling, extraction and processing. In: Proceedings of 2016 ACM Symposium on Document Engineering, pp. 1–2 (2016). https://doi.org/10.1145/2960811.2967173
Govindaraju, V., Zhang, C., Ré, C.: Understanding tables in context using standard NLP toolkits. In: Proceedings of 51st Annual Meeting of the Association for Computational Linguistics, pp. 658–664 (2013)
Google Scholar
Hassan, T., Baumgartner, R.: Table recognition and understanding from PDF files. In: Proceedings of 9th International Conference on Document Analysis and Recognition, vol. 02, pp. 1143–1147 (2007). http://dl.acm.org/citation.cfm?id=1304596.1304833
Hu, J., Liu, Y.: Analysis of documents born digital. In: Doermann, D., Tombre, K. (eds.) Handbook of Document Image Processing and Recognition, pp. 775–804. Springer, London (2014). https://doi.org/10.1007/978-0-85729-859-1_26
Chapter Google Scholar
Khusro, S., Latif, A., Ullah, I.: On methods and tools of table detection, extraction and annotation in PDF documents. J. Inf. Sci. 41(1), 41–57 (2015). https://doi.org/10.1177/0165551514551903
Article Google Scholar
Liu, Y., Bai, K., Mitra, P., Giles, C.L.: TableSeer: automatic table metadata extraction and searching in digital libraries. In: Proceedings of 7th ACM/IEEE Joint Conference on Digital Libraries, pp. 91–100 (2007). https://doi.org/10.1145/1255175.1255193
Nganji, J.T.: The portable document format (PDF) accessibility practice of four journal publishers. Libr. Inf. Sci. Res. 37, 254–262 (2015). http://www.sciencedirect.com/science/article/pii/S0740818815000134
Article Google Scholar
Nurminen, A.: Algorithmic extraction of data in tables in PDF documents. Master’s thesis, Tampere University of Technology, Tampere, Finland (2013)
Google Scholar
Oro, E., Ruffolo, M.: PDF-TREX: an approach for recognizing and extracting tables from PDF documents. In: Proceedings of 10th International Conference on Document Analysis and Recognition, pp. 906–910 (2009)
Google Scholar
Perez-Arriaga, M.O., Estrada, T., Abad-Mota, S.: TAO: system for table detection and extraction from PDF documents. In: Proceedings of 29th International Florida Artificial Intelligence Research Society Conference, pp. 591–596 (2016)
Google Scholar
Ramel, J.Y., Crucianu, M., Vincent, N., Faure, C.: Detection, extraction and representation of tables. In: Proceedings of 7th International Conference on Document Analysis and Recognition, vol. 1, pp. 374–378 (2003)
Google Scholar
Rastan, R., Paik, H.Y., Shepherd, J.: TEXUS: a task-based approach for table extraction and understanding. In: Proceedings of 2015 ACM Symposium on Document Engineering, pp. 25–34 (2015). https://doi.org/10.1145/2682571.2797069
Rastan, R., Paik, H.Y., Shepherd, J.: A PDF wrapper for table processing. In: Proceedings of 2016 ACM Symposium on Document Engineering, pp. 115–118 (2016). https://doi.org/10.1145/2960811.2967162
Sabol, V., Tschinkel, G., Veas, E., Hoefler, P., Mutlu, B., Granitzer, M.: Discovery and visual analysis of linked data for humans. In: Mika, P., et al. (eds.) ISWC 2014. LNCS, vol. 8796, pp. 309–324. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-11964-9_20
Chapter Google Scholar
Shigarov, A., Bychkov, I., Ruzhnikov, G., Khmel’nov, A.: A method of table detection in metafiles. Pattern Recognit. Image Anal. 19(4), 693–697 (2009). https://doi.org/10.1134/S1054661809040191
Article MATH Google Scholar
Shigarov, A.: Table understanding using a rule engine. Expert. Syst. Appl. 42(2), 929–937 (2015)
Article Google Scholar
Shigarov, A., Fedorov, R.: Simple algorithm page layout analysis. Pattern Recognit. Image Anal. 21(2), 324–327 (2011). https://doi.org/10.1134/S1054661811021008
Article Google Scholar
Shigarov, A., Mikhailov, A., Altaev, A.: Configurable table structure recognition in untagged PDF documents. In: Proceedings of 2016 ACM Symposium on Document Engineering, pp. 119–122 (2016). https://doi.org/10.1145/2960811.2967152
Shigarov, A.O., Mikhailov, A.A.: Rule-based spreadsheet data transformation from arbitrary to relational tables. Inf. Syst. 71, 123–136 (2017). https://doi.org/10.1016/j.is.2017.08.004
Article Google Scholar
e Silva, A.C.: Parts that add up to a whole: a framework for the analysis of tables. Ph.D. thesis, University of Edinburgh, Tampere, Finland (2010)
Google Scholar
e Silva, A.C., Jorge, A.M., Torgo, L.: Design of an end-to-end method to extract information from tables. Int. J. Doc. Anal. Recognit. (IJDAR) 8(2), 144–171 (2006)
Article Google Scholar
Yildiz, B., Kaiser, K., Miksch, S.: pdf2table: a method to extract table information from PDF files. In: Proceedings of 2nd Indian International Conference on Artificial Intelligence, Pune, India, pp. 1773–1785 (2005)
Google Scholar

Download references

Acknowledgments

This work is supported by the Russian Foundation for Basic Research (grants 18-07-00758 and 17-47-380007). The prototype of TabbyPDF is deployed on resources of the Shared Equipment Center of Integrated Information and Computing Network for Irkutsk Research and Educational Complex (http://net.icc.ru).

Author information

Authors and Affiliations

Matrosov Institute for System Dynamics and Control Theory of SB RAS, 134 Lermontov st., Irkutsk, Russia
Alexey Shigarov, Andrey Altaev, Andrey Mikhailov, Viacheslav Paramonov & Evgeniy Cherkashin
Institute of Mathematics, Economics and Informatics, Irkutsk State University, 20 Gagarin blvd., Irkutsk, Russia
Alexey Shigarov, Viacheslav Paramonov & Evgeniy Cherkashin

Authors

Alexey Shigarov
View author publications
You can also search for this author in PubMed Google Scholar
Andrey Altaev
View author publications
You can also search for this author in PubMed Google Scholar
Andrey Mikhailov
View author publications
You can also search for this author in PubMed Google Scholar
Viacheslav Paramonov
View author publications
You can also search for this author in PubMed Google Scholar
Evgeniy Cherkashin
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Alexey Shigarov .

Editor information

Editors and Affiliations

Kaunas University of Technology, Kaunas, Lithuania
Robertas Damaševičius
Kaunas University of Technology, Kaunas, Lithuania
Giedrė Vasiljevienė

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Shigarov, A., Altaev, A., Mikhailov, A., Paramonov, V., Cherkashin, E. (2018). TabbyPDF: Web-Based System for PDF Table Extraction. In: Damaševičius, R., Vasiljevienė, G. (eds) Information and Software Technologies. ICIST 2018. Communications in Computer and Information Science, vol 920. Springer, Cham. https://doi.org/10.1007/978-3-319-99972-2_20

Download citation

DOI: https://doi.org/10.1007/978-3-319-99972-2_20
Published: 29 August 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-99971-5
Online ISBN: 978-3-319-99972-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics