Advertisement

A Study on the Classification of Layout Components for Newspapers

  • Stefano FerilliEmail author
  • Floriana Esposito
  • Domenico Redavid
Conference paper
Part of the Communications in Computer and Information Science book series (CCIS, volume 701)

Abstract

While nowadays most newspapers are born-digital (typeset directly in PDF), up to a few years ago they were only available in printed form. Digitizing the paper artifact to make it available in digital libraries yields a sequence of raster images of the pages that make up the documents. Such images consist of just matrices of pixels, and carry no explicit information about their organization into meaningful higher-level components. So, in the perspective of automatically extracting useful information from the newspapers and indexing them for future retrieval, a necessary preliminary task is to identify the layout components that are meaningful from a human interpretation viewpoint.

Unfortunately, approaches proposed in the literature for automatic layout analysis are often ineffective on newspapers, because of the much more complex layout of this kind of documents compared, e.g., to books and scientific papers. This work specifically focuses on the classification of layout blocks according to their content type. It investigates on the adaptation of an existing approach, that has been successfully applied to documents having standard layout, to the case of newspapers, working on the description features and set of classes. The modified approach was implemented and embedded in the DoMInUS system for document processing and management. Experimental results aimed at its evaluation are reported and commented.

Keywords

Layout analysis Document representation Document rendering 

Notes

Acknowledgments

The authors would like to thank Vincenzo Raimondi for his help in implementing the prototype. This work was partially funded by the Italian PON 2007-2013 project PON02_00563_3489339 ‘Puglia@Service’.

References

  1. 1.
    Altamura, O., Esposito, F., Malerba, D.: Transforming paper documents into XML format with WISDOM++. Int. J. Doc. Anal. Recogn. 4, 2–17 (2001)CrossRefGoogle Scholar
  2. 2.
    Cao, H., Prasad, R., Natarajan, P., MacRostie, E.: Robust page segmentation based on smearing and error correction unifying top-down and bottom-up approaches. In: Proceedings of the 9th International Conference on Document Analysis and Recognition (ICDAR), vol. 1, pp. 392–396. IEEE Computer Society (2007)Google Scholar
  3. 3.
    Esposito, F., Ferilli, S., Basile, T.M.A., Di Mauro, N.: Machine learning for digital document processing: from layout analysis to metadata extraction. In: Marinai, S., Fujisawa, H. (eds.) Machine Learning in Document Analysis and Recognition. Studies in Computational Intelligence, vol. 90, pp. 105–138. Springer, Heidelberg (2008)CrossRefGoogle Scholar
  4. 4.
    Ferilli, S.: Automatic Digital Document Processing and Management - Problems, Algorithms and Techniques. Springer, London (2011)CrossRefGoogle Scholar
  5. 5.
    Ferilli, S., Biba, M., Esposito, F., Basile, T.M.A.: A distance-based technique for non-manhattan layout analysis. In: Proceedings of the 10th International Conference on Document Analysis Recognition (ICDAR), pp. 231–235 (2009)Google Scholar
  6. 6.
    Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The weka data mining software: an update. SIGKDD Explor. Newsl. 11(1), 10–18 (2009)CrossRefGoogle Scholar
  7. 7.
    Mitchell, P.E., Yan, H.: Newspaper layout analysis incorporating connected component separation. Image Vis. Comput. 22(4), 307–317 (2004)CrossRefGoogle Scholar
  8. 8.
    Mitchell, T.M.: Machine Learning. McGraw-Hill, New York (1997)zbMATHGoogle Scholar
  9. 9.
    Shih, F.Y., Chen, S.-S.: Adaptive document block segmentation and classification. IEEE Trans. Syst. Man Cybern. - Part B 26(5), 797–802 (1996)CrossRefGoogle Scholar
  10. 10.
    Sun, H.-M.: Page segmentation for Manhattan and non-manhattan layout documents via selective CRLA. In: Proceedings of the 8th International Conference on Document Analysis and Recognition (ICDAR), pp. 116–120. IEEE Computer Society (2005)Google Scholar
  11. 11.
    Wang, D., Srihari, S.N.: Classification of newspaper image blocks using texture analysis. Comput. Vis. Graph. Image Process. 47, 327–352 (1989)CrossRefGoogle Scholar
  12. 12.
    Wong, K.Y., Casey, R., Wahl, F.M.: Document analysis system. IBM J. Res. Dev. 26, 647–656 (1982)CrossRefGoogle Scholar

Copyright information

© Springer International Publishing AG 2017

Authors and Affiliations

  • Stefano Ferilli
    • 1
    Email author
  • Floriana Esposito
    • 1
  • Domenico Redavid
    • 2
  1. 1.University of BariBariItaly
  2. 2.Artificial Brain S.r.l.BariItaly

Personalised recommendations