Data Collection and Annotation for Arabic Document Analysis
The creation of good quality document corpora is not a trivial task, but such corpora are essential for advancing OCR technology. Documents in Arabic certainly present their own challenges to this process, and here we describe our data creation and annotation efforts for Arabic document analysis. The resulting corpora include both on-line and off-line handwritten data as well as logos, signatures, and mixed-script machine-printed text. All these are described in detail, and some typical examples of documents are given.
KeywordsGround Truth Document Image Handwriting Recognition Test Corpus Image Processing Task
The authors would like to express their thanks to David Doermann (University of Maryland) for providing an Arabic handwriting ground creation tool, Volker Märgner (TU Braunschweig) for providing access to the IFN database, Leila Saidi and Anna Borovikov (both CACI) for helping with Arabic documents ground truth creation and annotation, and Kristen Summers (CACI) for providing technical expertise and leadership in the area of document understanding in general and in Arabic OCR in particular. The authors also express their thanks to Luis Hernandez of the U.S. Army Research Laboratory for his support of various efforts described here. Part of the research reported in this document was supported by the U.S. Army Research Laboratory. The views and conclusions contained in this document are those of the authors and should not be interpreted as presenting the official policies or position, whether expressed or implied, of the U.S. Army Research Laboratory or the U.S. Government unless so designated by other authorized documents. Citation of manufacturer or trade names does not constitute an official endorsement or approval of the use thereof. The U.S. Government is authorized to reproduce and distribute reprints for Government purposes notwithstanding any copyright notation hereon.
- 1.Abed, H.E., Märgner, V.: Arabic handwriting recognition competition. Int. J. Doc. Anal. Recognit. (2010). Special Issue ICDAR 2009 Competitions Google Scholar
- 2.Borovikov, E., Zavorin, I.: A multi-stage approach to Arabic document analysis. In: Märgner, V., El Abed, H. (eds.) Guide to OCR for Arabic Scripts. Springer, Berlin (2012) Google Scholar
- 3.Chalechale, A., Naghdy, G., Premaratne, P., Mertins, A.: Cursive signature extraction and verification. In: Second Int. Workshop on Information Technology and Its Disciplines (WITID 2004), Kish Island, Iran, July 2004, pp. 109–113 (2004) Google Scholar
- 5.Fischer, A., Indermhle, E., Bunke, H., Viehhauser, G., Stolz, M.: Ground truth creation for handwriting recognition in historical documents. In: International Workshop on Document Analysis Systems (2010) Google Scholar
- 11.Märgner, V., El Abed, H.: ICDAR 2007—Arabic handwriting recognition competition. In: Proceedings of the 9th International Conference on Document Analysis and Recognition (ICDAR), vol. 2, pp. 1274–1278 (2007) Google Scholar
- 15.Pechwitz, M., Snoussi Maddouri, S., Märgner, V., Ellouze, N., Amiri, H.: IFN/ENIT-database of handwritten Arabic words. In: Proceedings of CIFED, pp. 129–136 (2002) Google Scholar
- 16.Zi, G., Doermann, D.: Document image ground truth generation from electronic text. Proc. Int. Conf. Pattern Recognit. 2, 663–666 (2004) Google Scholar