Advertisement

Recognition of the Logical Structure of Arabic Newspaper Pages

  • Hassina Bouressace
  • Janos Csirik
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11107)

Abstract

In document analysis and recognition, we seek to apply methods of automatic document identification. The main goal is to go from a simple image to a structured set of information exploitable by machine. Here, we present a system for recognizing the logical structure (hierarchical organization) of Arabic newspapers pages. These are characterized by a rich and variable structure. They may contain several articles composed of titles, figures, author’s names and figure captions. However, the logical structure recognition of a newspaper page is preceded by the extraction of its physical structure. This extraction is performed in our system using a combined method which is essentially based on the RLSA (Run Length Smearing/Smoothing Algorithm) [1], projections profile analysis, and connected components labeling. Logical structure extraction is then performed based on certain rules of sizes and positions of the physical elements extracted earlier, and also on an a priori knowledge of certain properties of logical entities (titles, figures, authors, captions, etc.). Lastly, the hierarchical organization of the document is represented as an XML file generated automatically. To evaluate the performance of our system, we tested it on a set of images and the results are encouraging.

Keywords

Arabic language Document recognition Physical structure Logical structure Document processing Segmentation 

References

  1. 1.
    Wong, K.Y., Casey, R.G., Wahl, F.M.: Document analysis system. IBM J. Res. Dev. 26, 647–656 (1982)CrossRefGoogle Scholar
  2. 2.
    Gatos, B., Mantzarisl, S., Antonacopoulos, A.: First international newspaper segmentation contest. In: Proceedings of the 6th International Conference on Document Analysis and Recognition, pp. 1190–1194 (2001)Google Scholar
  3. 3.
    Liu, F., Luo, Y., Yoshikawa, M., Hu, D.: A new component based algorithm for newspaper layout analysis. In: Proceedings of the 6th International Conference on Document Analysis and Recognition (ICDAR), pp. 1176–1179. IEEE Computer Society (2001)Google Scholar
  4. 4.
    Jain, A.K., Yu, B.: Document representation and its application to page decomposition. IEEE Trans. Pattern Anal. Mach. Intell. J. 20, 294–308 (1998)CrossRefGoogle Scholar
  5. 5.
    Mitchell, P.E., Yan, H.: Newspaper document analysis featuring connected line segmentation. In: 6th International Conference on Document Analysis and Recognition, pp. 1181–1185 (2001)Google Scholar
  6. 6.
    Hadjar, K., Ingold, R.: Arabic newspaper page segmentation. In: 7th International Conference on Document Analysis and Recognition, pp. 895–899 (2003)Google Scholar
  7. 7.
    Antonacopoulos, C., Clausner, C., Papadopoulos, S., Pletschacher, S.: Historical document layout analysis competition. In: Proceedings of the 11th International Conference on Document Analysis and Recognition, pp. 1516–1520 (2011)Google Scholar
  8. 8.
    Antonacopoulos, A., Pletschacher, S., Bridson, D., Papadopoulos, C.: ICDAR 2009 page segmentation competition. In: 10th International Conference on Document Analysis and Recognition, University of Salford, pp. 1370–1374 (2009)Google Scholar
  9. 9.
    Robadey, L.: 2 (CREM): Une méthode de reconnaissance structurelle de documents complexes basée sur des patterns bidimensionnels, Doctoral thesis, University of Friborg-Suisse (2001)Google Scholar
  10. 10.
    Hadjar, K., Hitz, O., Ingold, R.: Newspaper page decomposition using a split and merge approach. In: 6th International Conference on Document Analysis and Recognition (ICDAR), pp. 1186–1189 (2001)Google Scholar
  11. 11.
    Palfray, T., Hébert, D., Tranouez, P., Nicolas, S., Paquet, T.: Segmentation logique d’images de journaux anciens, Francophone International Conference on Writing and Document, p. 317 (2012)Google Scholar
  12. 12.
    Boufersaoui, H., Frihi, I.: Extraction of the logical structure of documents, Master’s thesis of Media Engineering, University, 08 May 1945-Guelma (2015)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  1. 1.University of SzegedSzegedHungary

Personalised recommendations