Data Collection and Annotation for Arabic Document Analysis

  • Ilya Zavorin
  • Eugene Borovikov


The creation of good quality document corpora is not a trivial task, but such corpora are essential for advancing OCR technology. Documents in Arabic certainly present their own challenges to this process, and here we describe our data creation and annotation efforts for Arabic document analysis. The resulting corpora include both on-line and off-line handwritten data as well as logos, signatures, and mixed-script machine-printed text. All these are described in detail, and some typical examples of documents are given.


Ground Truth Document Image Handwriting Recognition Test Corpus Image Processing Task 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.



The authors would like to express their thanks to David Doermann (University of Maryland) for providing an Arabic handwriting ground creation tool, Volker Märgner (TU Braunschweig) for providing access to the IFN database, Leila Saidi and Anna Borovikov (both CACI) for helping with Arabic documents ground truth creation and annotation, and Kristen Summers (CACI) for providing technical expertise and leadership in the area of document understanding in general and in Arabic OCR in particular. The authors also express their thanks to Luis Hernandez of the U.S. Army Research Laboratory for his support of various efforts described here. Part of the research reported in this document was supported by the U.S. Army Research Laboratory. The views and conclusions contained in this document are those of the authors and should not be interpreted as presenting the official policies or position, whether expressed or implied, of the U.S. Army Research Laboratory or the U.S. Government unless so designated by other authorized documents. Citation of manufacturer or trade names does not constitute an official endorsement or approval of the use thereof. The U.S. Government is authorized to reproduce and distribute reprints for Government purposes notwithstanding any copyright notation hereon.


  1. 1.
    Abed, H.E., Märgner, V.: Arabic handwriting recognition competition. Int. J. Doc. Anal. Recognit. (2010). Special Issue ICDAR 2009 Competitions Google Scholar
  2. 2.
    Borovikov, E., Zavorin, I.: A multi-stage approach to Arabic document analysis. In: Märgner, V., El Abed, H. (eds.) Guide to OCR for Arabic Scripts. Springer, Berlin (2012) Google Scholar
  3. 3.
    Chalechale, A., Naghdy, G., Premaratne, P., Mertins, A.: Cursive signature extraction and verification. In: Second Int. Workshop on Information Technology and Its Disciplines (WITID 2004), Kish Island, Iran, July 2004, pp. 109–113 (2004) Google Scholar
  4. 4.
    El Abed, H., Märgner, V.: Comparison of combination methods of Arabic handwritten word recognizers. In: 5th International Multi-Conference on Systems, Signals and Devices, pp. 1–6 (2008) CrossRefGoogle Scholar
  5. 5.
    Fischer, A., Indermhle, E., Bunke, H., Viehhauser, G., Stolz, M.: Ground truth creation for handwriting recognition in historical documents. In: International Workshop on Document Analysis Systems (2010) Google Scholar
  6. 6.
  7. 7.
  8. 8.
  9. 9.
    Li, Y., Zheng, Y., Doermann, D., Jaeger, S.: Script-independent text line segmentation in freestyle handwritten documents. IEEE Trans. Pattern Anal. Mach. Intell. 30(8), 1313–1329 (2008) CrossRefGoogle Scholar
  10. 10.
    Lorigo, L.M., Govindaraju, V.: Off-line Arabic handwriting recognition: a survey. IEEE Trans. Pattern Anal. Mach. Intell. 28(5), 712–724 (2006) CrossRefGoogle Scholar
  11. 11.
    Märgner, V., El Abed, H.: ICDAR 2007—Arabic handwriting recognition competition. In: Proceedings of the 9th International Conference on Document Analysis and Recognition (ICDAR), vol. 2, pp. 1274–1278 (2007) Google Scholar
  12. 12.
    Märgner, V., El Abed, H.: ICDAR 2009—Arabic handwriting recognition competition. In: Proceedings of the 10th International Conference on Document Analysis and Recognition (ICDAR), July 2009, vol. 3, pp. 1383–1387 (2009) CrossRefGoogle Scholar
  13. 13.
    Märgner, V., El Abed, H.: ICFHR 2010—Arabic handwriting recognition competition. In: Proceedings of the 12th International Conference on Frontiers in Handwriting, November, pp. 709–714 (2010) CrossRefGoogle Scholar
  14. 14.
    Märgner, V., Pechwitz, M., El Abed, H.: ICDAR 2005—Arabic handwriting recognition competition. In: Proceedings of the 8th International Conference on Document Analysis and Recognition (ICDAR), vol. 1, pp. 70–74 (2005) CrossRefGoogle Scholar
  15. 15.
    Pechwitz, M., Snoussi Maddouri, S., Märgner, V., Ellouze, N., Amiri, H.: IFN/ENIT-database of handwritten Arabic words. In: Proceedings of CIFED, pp. 129–136 (2002) Google Scholar
  16. 16.
    Zi, G., Doermann, D.: Document image ground truth generation from electronic text. Proc. Int. Conf. Pattern Recognit. 2, 663–666 (2004) Google Scholar

Copyright information

© Springer-Verlag London 2012

Authors and Affiliations

  1. 1.Knowledge and Information Management DivisionCACILanhamUSA

Personalised recommendations