An Efficient Pre-processing Method to Identify Logical Components from PDF Documents

Liu, Ying; Bai, Kun; Gao, Liangcai

doi:10.1007/978-3-642-20841-6_41

Ying Liu²²,
Kun Bai²³ &
Liangcai Gao²⁴

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 6634))

Included in the following conference series:

Pacific-Asia Conference on Knowledge Discovery and Data Mining

1695 Accesses
4 Citations

Abstract

As the rapid growth of the scientific documents in digital libraries, the search demands for the documents as well as specific components increase dramatically. Accurately detecting the component boundary is of vital importance to all the further information extraction and applications. However, document component boundary detection (especially the table, figure, and equation) is a challenging problem because there is no standardized formats and layouts across diverse documents.

This paper presents an efficient document preprocessing technique to improve the document component boundary detection performance by taking advantage of the nature of document lines. Our method easily simplifies the component boundary detection problem into the sparse line analysis problem with much less noise. We define eight document line label types and apply machine learning techniques as well as the heuristic rule-based method on identifying multiple document components. Combining with different heuristic rules, we extract the multiple components in a batch way by filtering out massive noises as early as possible. Our method focus on an important un-tagged document format – PDF documents. The experimental results prove the effectiveness of the sparse line analysis.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Burges, C.J.C.: A tutorial on support vector machines for pattern recognition. Journal Data Mining and Knowledge Discovery 2(2), 121–167 (1998)
Article Google Scholar
Chao, H., Fan, J.: Layout and content extraction for PDF documents. In: Marinai, S., Dengel, A.R. (eds.) DAS 2004. LNCS, vol. 3163, pp. 213–224. Springer, Heidelberg (2004)
Chapter Google Scholar
Chen, S.T.H., Tsai, J.: Mining tables from large scale html texts. In: Proc. 18th International Conference Computational Liguistics, Saarbrucken, Germany (2000)
Google Scholar
Ha, J., Haralick, R., Philips, I.: Recursive x-y cut using bounding boxes of connected components. In: Proc. Third International Conference Document Analysis and Recognition, pp. 952–955 (1955)
Google Scholar
Hurst, M.: Layout and language: Challenges for table understanding on the web. In: Proceedings of the International Workshop on Web Document Analysis, pp. 27–30 (2001)
Google Scholar
Shin, N.G.J.: Table recognition and evaluation. In: Proceeding of the Class of 2005 Senior Conference, Computer Science Department, Swarthmore College, pp. 8–13 (2005)
Google Scholar
Joachims, T.: Svm light, http://svmlight.joachims.org/
Kieninger, T., Dengel, A.: Applying the t-rec table recognition system to the business letter domain. In: In Proc. of the 6th International Conference on Document Analysis and Recognition, pp. 518–522 (September 2001)
Google Scholar
Kieninger, T.G.: Table structure recognition based on robust block segmentation. In: Proceeding of Document Recognition V, SPIE, vol. 3305, pp. 22–32 (January 1998)
Google Scholar
Krupl, B., Herzog, M., Gatterbauer, W.: Using visual cues for extraction of tabular data from arbitrary html documents. In: Proceeding of the 14th International Conference on World Wide Web, pp. 1000–1001 (2005)
Google Scholar
Lafferty, J., McCallum, A., Pereira, F.: Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In: Proc. 18th ICML, pp. 282–289. Morgan Kaufmann, San Francisco (2001)
Google Scholar
Liu, Y., Bai, K., Mitra, P., Giles, C.L.: TableSeer: Automatic Table Metadata Extraction and Searching in Digital Libraries. In: ACM/IEEE Joint Conference on Digital Libraries, JCDL, pp. 91–100 (2007)
Google Scholar
Liu, Y., Bai, K., Mitra, P., Giles, C.L.: Improving the table boundary detection in pdfs by fixing the sequence error of the sparse lines. In: 10th International Conference on Document Analysis and Recognition, ICDAR 2009 (2009)
Google Scholar
Liu, Y., Mitra, P., Giles, C.L.: Identifying Table Boundaries in Digital Documents via Sparse Line Detection. In: CIKM 2008, Napa Valley, California (2008)
Google Scholar
McCallum, A.: Efficiently inducing features of conditional random fields. In: Nineteenth Conference on UAI (2003)
Google Scholar
McCallum, A., Li, W.: Early results for named entity recognition with conditional random fields. In: CONLL 2003 Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003, vol. 4 (2003)
Google Scholar
Ng, H., Lim, C., Koo, J.: Learning to recognize tables in free text. In: ACL Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics on Computational Linguistics (1999)
Google Scholar
Ng, H., Lim, C.Y., Koo, J.T.: Learning to recognize tables in free text. In: Proc. of the 37th Annual Meeting of the Association of Computational Linguistics on Computational Linguistics, pp. 443–450 (1999)
Google Scholar
Penn, G., Hu, J., Luo, H., McDonald, R.: Flexible web document analysis for delivery to narrow-bandwidth devices. In: Sixth International Conference on Document Analysis and Recognition (2001)
Google Scholar
Pinto, D., McCallum, A., Wei, X., Bruce, W.: Table extraction using conditional random fields. In: Proceeding of Proceedings of the 26th ACM SIGIR, Toronto, Canada (July 2003)
Google Scholar
Safavian, S., Landgrebe, D.: A survey of decision tree classifier methodology. SMC 21(3), 660–674 (1991)
MathSciNet Google Scholar
Sha, F., Pereira, F.: Shallow parsing with conditional random fields. In: NAACL 2003, Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology, vol. 1 (2003)
Google Scholar
Shamilian, J., Baird, H., Wood, T.: A retargetable table reader. In: Proc. of the 4th Int’l Conf. on Document Analysis and Recognition, pp. 158–163 (1997)
Google Scholar
Stoffel, A., Spretke, D., Kinnemann, H., Keim, D.A.: Enhancing document structure analysis using visual analytics. In: SAC 2010 Proceedings of the 2010 ACM Symposium on Applied Computing (2010)
Google Scholar
Wang, J., Hu, J.: A machine learning based approach for table detection on the web. In: The Eleventh International World Wide Web Conference 2002, pp. 242–250 (November 2002)
Google Scholar
Wang, Y., Hu, J.: Detecting tables in html documents. In: Proc. of the 5th IAPR DAS, Princeton, NJ (2002)
Google Scholar
Wang, Y., Philips, I., Haralick, R.: Automatic table ground truth generation and a background-analysis-based table structure extraction method. In: Proc. of the 6th Int’l Conference on Document Analysis and Recognition, p. 528 (September 2001)
Google Scholar
Yildiz, B., Kaiser, K., Miksch, S.: pdf2table: A method to extract table information from pdf files. In: Proceedings of the 2nd Indian International Conference on Artificial Intelligence IICAI 2005, Pune, India (2005)
Google Scholar
Yoshida, M., Torisawa, K., Tsujii, J.: A method to integrate tables of the world wide web. In: Proceedings of the International Workshop on Web Document Analysis (WDA 2001) (2001)
Google Scholar
Zanibbi, R., Blostein, D., Cordy, J.: A survey of table recognition: Models, observations, transformations, and inferences. Int’l J. Document Analysis and Recognition 7(1), 1–16 (2004)
Google Scholar
Zheng, Z.: Naive bayesian classifier committees. In: Nédellec, C., Rouveirol, C. (eds.) ECML 1998. LNCS, vol. 1398, pp. 196–207. Springer, Heidelberg (1998)
Chapter Google Scholar

Download references

Author information

Authors and Affiliations

Department of Knowledge Service Engineering, KAIST, South Korea
Ying Liu
IBM Research T.J. Watson Research Center, USA
Kun Bai
Institute of Computer Science and Technology, Peking University, China
Liangcai Gao

Authors

Ying Liu
View author publications
You can also search for this author in PubMed Google Scholar
Kun Bai
View author publications
You can also search for this author in PubMed Google Scholar
Liangcai Gao
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Shenzhen Institutes of Advanced Technology (SIAT), Chinese Academy of Sciences, 518055, Shenzhen, China
Joshua Zhexue Huang
Faculty of Engineering and Information Technology, Center for Quantum Computation and Intelligent Systems, Data Sciences and Knowledge Discovery Lab, University of Technology Sydney, NSW 2007, Sydney, Australia
Longbing Cao
Department of Computer Science and Engineering, University of Minnesota, MN 55455, Minneapolis, USA
Jaideep Srivastava

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Liu, Y., Bai, K., Gao, L. (2011). An Efficient Pre-processing Method to Identify Logical Components from PDF Documents. In: Huang, J.Z., Cao, L., Srivastava, J. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2011. Lecture Notes in Computer Science(), vol 6634. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-20841-6_41

Download citation

DOI: https://doi.org/10.1007/978-3-642-20841-6_41
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-20840-9
Online ISBN: 978-3-642-20841-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics