TOC Structure Extraction from OCR-ed Books

Liu, Caihua; Chen, Jiajun; Zhang, Xiaofeng; Liu, Jie; Huang, Yalou

doi:10.1007/978-3-642-35734-3_8

Caihua Liu¹⁹,
Jiajun Chen²⁰,
Xiaofeng Zhang²⁰,
Jie Liu¹⁹ &
…
Yalou Huang²⁰

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 7424))

Included in the following conference series:

International Workshop of the Initiative for the Evaluation of XML Retrieval

594 Accesses
6 Citations

Abstract

This paper addresses the task of extracting the table of contents (TOC) from OCR-ed books. Since the OCR process misses a lot of layout and structural information, it is incapable of enabling navigation experience. A TOC is needed to provide a convenient and quick way to locate the content of interest. In this paper, we propose a hybrid method to extract TOC, which is composed of rule-based method and SVM-based method. The rule-based method mainly focuses on discovering the TOC from the books with TOC pages while the SVM-based method is employed to handle with the books without TOC pages. Experimental results indicate that the proposed methods obtain comparable performance against the other participants of the ICDAR 2011 Book structure extraction competition.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Dresevic, B., Uzelac, A., Radakovic, B., Todic, N.: Book Layout Analysis: TOC Structure Extraction Engine. In: Geva, S., Kamps, J., Trotman, A. (eds.) INEX 2008. LNCS, vol. 5631, pp. 164–171. Springer, Heidelberg (2009)
Chapter Google Scholar
Doucet, A., Kazai, G., Dresevic, B., Uzelac, A., Radakovic, B., Todic, N.: Setting up a Competition Framework for the Evaluation of Structure Extraction from OCR-ed Books. International Journal of Document Analysis and Recognition (IJDAR) 14(1), 45–52 (2010)
Article Google Scholar
Giguet, E., Lucas, N.: The Book Structure Extraction Competition with the Resurgence Software at Caen University. In: Geva, S., Kamps, J., Trotman, A. (eds.) INEX 2009. LNCS, vol. 6203, pp. 170–178. Springer, Heidelberg (2010)
Chapter Google Scholar
Déjean, H., Meunier, J.-L.: XRCE Participation to the 2009 Book Structure Task. In: Geva, S., Kamps, J., Trotman, A. (eds.) INEX 2009. LNCS, vol. 6203, pp. 160–169. Springer, Heidelberg (2010)
Chapter Google Scholar

Download references

Author information

Authors and Affiliations

College of Information Technical Science, Nankai University, Tianjin, China, 300071
Caihua Liu & Jie Liu
College of Software, Nankai University, Tianjin, China, 300071
Jiajun Chen, Xiaofeng Zhang & Yalou Huang

Authors

Caihua Liu
View author publications
You can also search for this author in PubMed Google Scholar
Jiajun Chen
View author publications
You can also search for this author in PubMed Google Scholar
Xiaofeng Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Jie Liu
View author publications
You can also search for this author in PubMed Google Scholar
Yalou Huang
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Faculty of Science and Technology, Queensland University of Technology (QUT), PO Box 2434, 4001, Brisbane, QLD, Australia
Shlomo Geva
Archives and Information Studies/Humanities, University of Amsterdam, Turfdraagsterpad 9, 1012XT, Amsterdam, The Netherlands
Jaap Kamps
Cluster of Excellence, , , Multimodal Computing and Interaction Cluster of Excellence, Multimodal Computing and Interaction, Saarland University, Campus E1, 66123, Saarbrücken, Germany
Ralf Schenkel

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Liu, C., Chen, J., Zhang, X., Liu, J., Huang, Y. (2012). TOC Structure Extraction from OCR-ed Books. In: Geva, S., Kamps, J., Schenkel, R. (eds) Focused Retrieval of Content and Structure. INEX 2011. Lecture Notes in Computer Science, vol 7424. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-35734-3_8

Download citation

DOI: https://doi.org/10.1007/978-3-642-35734-3_8
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-35733-6
Online ISBN: 978-3-642-35734-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics