Abstract
The definition of standard frameworks for performance evaluation is a key issue in order to advance the state-of-the-art in any field of document analysis since it permits a fair and objective comparison of different proposed methods under a common scenario. For that reason, a large number of public datasets have emerged in the last years. However, several challenges must be considered when creating such datasets in order to get a sufficiently large collection of representative data that can be easily exploited by the researchers. In this chapter we review different approaches followed by the document analysis community to address some of these challenges, such as the collection of representative data, its annotation with ground-truth information, or the representation using accepted and common formats. We also provide a comprehensive list of existing public datasets for each of the different areas of document analysis.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Alamri H, Sadri J, Suen CY, Nobile N (2008) A novel comprehensive database for Arabic off-line handwriting recognition. In: Proceedings of the 11th international conference on frontiers in handwriting recognition (ICFHR 2008), Montréal, pp 664–669
Al-Ohali Y, Cheriet M, Suen C (2003) Databases for recognition of handwritten arabic cheques. Pattern Recognit 36(1):111–121. doi:10.1016/S0031-3203(02)00064-X, URL: http://www.sciencedirect.com/science/article/pii/S003132030200064X
Antonacopoulos A, Karatzas D, Bridson D (2006) Ground truth for layout analysis performance evaluation. In: Proceedings of the 7th IAPR workshop on document analysis systems (DAS2006), Nelson. Springer, pp 302–311
Antonacopoulos A, Bridson D, Papadopoulos C, Pletschacher S (2009) A realistic dataset for performance evaluation of document layout analysis. In: 10th international conference on document analysis and recognition (ICDAR’09), Barcelona, 2009, pp 296–300. doi:10.1109/ICDAR.2009.271
Antonacopoulos A, Clausner C, Papadopoulos C, Pletschacher S (2011) Historical document layout analysis competition. In: 11th international conference on document analysis and recognition (ICDAR’11), Beijing, 2011
Baird HS (1995) Document image defect models. In: O’Gorman L, Kasturi R (eds) Document image analysis. IEEE Computer Society, Los Alamitos, pp 315–325. URL: http://dl.acm.org/citation.cfm?id=201573.201660
Bhattacharya U, Chaudhuri B (2009) Handwritten numeral databases of Indian scripts and multistage recognition of mixed numerals. IEEE Trans Pattern Anal Mach Intell 31(3): 444–457. doi:10.1109/TPAMI.2008.88
Blankers V, Heuvel C, Franke K, Vuurpijl L (2009) ICDAR 2009 signature verification competition. In: 10th international conference on document analysis and recognition (ICDAR’09), Barcelona, 2009, pp 1403–1407. doi:10.1109/ICDAR.2009.216
Bukhari SS, Shafait F, Breuel TM (2012) The IUPR dataset of camera-captured document images. In: Proceedings of the 4th international conference on camera-based document analysis and recognition (CBDAR’11), Beijing. Springer, Berlin/Heidelberg, pp 164–171
Dalitz C, Droettboom M, Pranzas B, Fujinaga I (2008) A comparative study of staff removal algorithms. IEEE Trans Pattern Anal Mach Intell 30:753–766. doi:http://doi.ieeecomputersociety.org/10.1109/TPAMI.2007.70749
Delalandre M, Valveny E, Pridmore T, Karatzas D (2010) Generation of synthetic documents for performance evaluation of symbol recognition & spotting systems. Int J Doc Anal Recognit 13:187–207. doi:http://dx.doi.org/10.1007/s10032-010-0120-x, URL: http://dx.doi.org/10.1007/s10032-010-0120-x
Doucet A, Kazai G, Dresevic B, Uzelac A, Radakovic B, Todic N (2011) Setting up a competition framework for the evaluation of structure extraction from OCR-ed books. Int J Doc Anal Recognit 14:45–52. doi:http://dx.doi.org/10.1007/s10032-010-0127-3, URL: http://dx.doi.org/10.1007/s10032-010-0127-3
El Abed H, Kherallah M, Märgner V, Alimi AM (2011) On-line Arabic handwriting recognition competition: ADAB database and participating systems. Int J Doc Anal Recognit 14: 15–23. doi:http://dx.doi.org/10.1007/s10032-010-0124-6, URL: http://dx.doi.org/10.1007/s10032-010-0124-6
Fierrez J, Galbally J, Ortega-Garcia J, Freire M, Alonso-Fernandez F, Ramos D, Toledano D, Gonzalez-Rodriguez J, Siguenza J, Garrido-Salas J, Anguiano E, Gonzalez-de Rivera G, Ribalda R, Faundez-Zanuy M, Ortega J, Cardeñoso-Payo V, Viloria A, Vivaracho C, Moro Q, Igarza J, Sanchez J, Hernaez I, Orrite-Uruñuela C, Martinez-Contreras F, Gracia-Roche J (2010) BiosecurID: a multimodal biometric database. Pattern Anal Appl 13:235–246. doi:10.1007/s10044-009-0151-4, URL: http://dx.doi.org/10.1007/s10044-009-0151-4
Fischer A, Indermühle E, Bunke H, Viehhauser G, Stolz M (2010) Ground truth creation for handwriting recognition in historical documents. In: Proceedings of the 9th IAPR international workshop on document analysis systems (DAS’10), Boston. ACM, New York, pp 3–10. doi:http://doi.acm.org/10.1145/1815330.1815331, URL: http://doi.acm.org/10.1145/1815330.1815331
Fornés A, Dutta A, Gordo A, Lladós J (2012) CVC-MUSCIMA: a ground truth of handwritten music score images for writer identification and staff removal. Int J Doc Anal Recognit 15(3), 243–251. doi:10.1007/s10032-011-0168-2, URL: http://dx.doi.org/10.1007/s10032-011-0168-2
Fruchterman T (1995) DAFS: a standard for document and image understanding. In: Proceedings of the symposium on document image understanding technology, Bowes, pp 94–100
Garain U, Chaudhuri B (2005) A corpus for OCR research on mathematical expressions. Int J Doc Anal Recognit 7:241–259. doi:10.1007/s10032-004-0140-5, URL: http://dl.acm.org/citation.cfm?id=1102243.1102246
Gatos B, Ntirogiannis K, Pratikakis I (2009) ICDAR2009 document image binarization contest (DIBCO 2009). In: 10th international conference on document analysis and recognition (ICDAR’09), Barcelona, 2009, pp 1375–1382. doi:10.1109/ICDAR.2009.246
Gatos B, Stamatopoulos N, Louloudis G (2011) ICDAR2009 handwriting segmentation contest. Int J Doc Anal Recognit 14:25–33. doi:10.1007/s10032-010-0122-8, URL: http://dx.doi.org/10.1007/s10032-010-0122-8
Guyon I, Schomaker L, Plamondon R, Liberman M, Janet S (1994) Unipen project of on-line data exchange and recognizer benchmarks. In: Proceedings of the international conference on pattern recognition, Jerusalem, pp 29–33
Hassaï andne A, Al-Maadeed S, Alja’am JM, Jaoua A, Bouridane A (2011) The ICDAR2011 Arabic writer identification contest. In: International conference on document analysis and recognition (ICDAR), Beijing, 2011, pp 1470–1474. doi:10.1109/ICDAR.2011.292
Helmers M, Bunke H (2003) Generation and use of synthetic training data in cursive handwriting recognition. In: Perales F, Campilho A, de la Blanca N, Sanfeliu A (eds) Pattern recognition and image analysis. Lecture notes in computer science, vol 2652. Springer, Berlin/Heidelberg, pp 336–345
Hu J, Kashi RS, Lopresti DP, Wilfong GT (2002) Evaluating the performance of table processing algorithms. Int J Doc Anal Recognit 4(3):140–153
Indermühle E, Liwicki M, Bunke H (2010) IAMonDo-database: an online handwritten document database with non-uniform contents. In: Proceedings of the 9th IAPR international workshop on document analysis systems (DAS’10), Boston. ACM, New York, pp 97–104. doi:http://doi.acm.org/10.1145/1815330.1815343, URL: http://doi.acm.org/10.1145/1815330.1815343
Kanai J, Rice SV, Nartker TA, Nagy G (1995) Automated evaluation of OCR zoning. IEEE Trans Pattern Anal Mach Intell 17:86–90. doi:http://doi.ieeecomputersociety.org/ 10.1109/34.368146
Kanungo T, Haralick RM, Stuezle W, Baird HS, Madigan D (2000) A statistical, nonparametric methodology for document degradation model validation. IEEE Trans Pattern Anal Mach Intell 22:1209–1223. doi:http://dx.doi.org/10.1109/34.888707, URL: http://dx.doi.org/10.1109/34.888707
Khosravi H, Kabir E (2007) Introducing a very large dataset of handwritten Farsi digits and a study on their varieties. Pattern Recognit Lett 28:1133–1141. doi:10.1016/j.patrec.2006.12.022, URL: http://dl.acm.org/citation.cfm?id=1243503.1243603
Kim DW, Kanungo T (2002) Attributed point matching for automatic groundtruth generation. Int J Doc Anal Recognit 5:47–66. doi:10.1007/s10032-002-0083-7, URL: http://dx.doi.org/10.1007/s10032-002-0083-7
Lee CH, Kanungo T (2003) The architecture of TRUEVIZ: a groundtruth/metadata editing and visualizing toolkit. Pattern Recognit 36(3):811–825. doi:10.1016/S0031-3203(02)00101-2, URL: http://www.sciencedirect.com/science/article/pii/S0031320302001012
Liang J, Phillips IT, Haralick RM (1997) Performance evaluation of document layout analysis algorithms on the UW data set. In: Proceedings of the SPIE document recognition IV, San Jose, pp 149–160
Liwicki M, Bunke H (2005) IAM-OnDB – an on-line English sentence database acquired from handwritten text on a whiteboard. In: Proceedings of the eighth international conference on document analysis and recognition (ICDAR’05), Seoul. IEEE Computer Society, Washington, DC, pp 956–961. doi:http://dx.doi.org/10.1109/ICDAR.2005.132, URL: http://dx.doi.org/10.1109/ICDAR.2005.132
Liwicki M, van den Heuvel C, Found B, Malik M (2010) Forensic signature verification competition 4NSigComp2010 – detection of simulated and disguised signatures. In: International conference on frontiers in handwriting recognition (ICFHR), Kolkata, 2010, pp 715–720. doi:10.1109/ICFHR.2010.116
Liwicki M, Malik M, van den Heuvel C, Chen X, Berger C, Stoel R, Blumenstein M, Found B (2011) Signature verification competition for online and offline skilled forgeries (SigComp2011). In: International conference on document analysis and recognition (ICDAR), Beijing, 2011, pp 1480–1484. doi:10.1109/ICDAR.2011.294
Lopresti D (2009) Optical character recognition errors and their effects on natural language processing. Int J Doc Anal Recognit 12:141–151. doi:10.1007/s10032-009-0094-8, URL: http://dx.doi.org/10.1007/s10032-009-0094-8
Louloudis G, Stamatopoulos N, Gatos B (2011) ICDAR 2011 writer identification contest. In: International conference on document analysis and recognition (ICDAR), Beijing, 2011, pp 1475–1479. doi:10.1109/ICDAR.2011.293
Lucas SM, Panaretos A, Sosa L, Tang A, Wong S, Young R (2003) ICDAR 2003 robust reading competitions. In: Proceedings of the seventh international conference on document analysis and recognition (ICDAR’03), Edinburgh, vol 2. IEEE Computer Society, Washington, DC, pp 682–687. URL: http://dl.acm.org/citation.cfm?id=938980.939531
MacLean S, Labahn G, Lank E, Marzouk M, Tausky D (2011) Grammar-based techniques for creating ground-truthed sketch corpora. Int J Doc Anal Recognit 14: 65–74. doi:http://dx.doi.org/10.1007/s10032-010-0118-4, URL: http://dx.doi.org/10.1007/s10032-010-0118-4
Marti UV, Bunke H (1999) A full English sentence database for off-line handwriting recognition. In: Proceedings of the fifth international conference on document analysis and recognition (ICDAR’99), Bangalore. IEEE Computer Society, Washington, DC, pp 705–708. URL: http://dl.acm.org/citation.cfm?id=839279.840504
Mihov S, Schulz K, Ringlstetter C, Dojchinova V, Nakova V, Kalpakchieva K, Gerasimov O, Gotscharek A, Gercke C (2005) A corpus for comparative evaluation of OCR software and postcorrection techniques. In: Proceedings of the eighth international conference on document analysis and recognition, Seoul, 2005, vol 1, pp 162–166. doi:10.1109/ICDAR.2005.6
Moll M, Baird H, An C (2008) Truthing for pixel-accurate segmentation. In: The eighth IAPR international workshop on document analysis systems (DAS’08), Japan, 2008, pp 379–385. doi:10.1109/DAS.2008.47
Mori M, Suzuki A, Shio A, Ohtsuka S (2000) Generating new samples from handwritten numerals based on point correspondence. In: Proceedings of the 7th international workshop on frontiers in handwriting recognition (IWFHR2000), Amsterdam, pp 281–290
Mouchere H, Viard-Gaudin C, Kim DH, Kim JH, Garain U (2011) CROHME2011: competition on recognition of online handwritten mathematical expressions. In: International conference on document analysis and recognition (ICDAR), Beijing, 2011, pp 1497–1500. doi:10.1109/ICDAR.2011.297
Ntirogiannis K, Gatos B, Pratikakis I (2008) An objective evaluation methodology for document image binarization techniques. In: The eighth IAPR international workshop on document analysis systems (DAS’08), Nara, 2008, pp 217–224. doi:10.1109/DAS.2008.41
Okamoto M, Imai H, Takagi K (2001) Performance evaluation of a robust method for mathematical expression recognition. In: International conference on document analysis and recognition, Seattle, p 0121. doi:http://doi.ieeecomputersociety.org/10.1109/ICDAR.2001.953767
Ortega-Garcia J, Fierrez-Aguilar J, Simon D, Gonzalez J, Faundez-Zanuy M, Espinosa V, Satue A, Hernaez I, Igarza JJ, Vivaracho C, Escudero D, Moro QI (2003) MCYT baseline corpus: a bimodal biometric database. IEE Proc Vis Image Signal Process 150(6):395–401. doi:10.1049/ip-vis:20031078
Paredes R, Kavallieratou E, Lins RD (2010) ICFHR 2010 contest: quantitative evaluation of binarization algorithms. In: International conference on frontiers in handwriting recognition, Kolkata, pp 733–736. doi:http://doi.ieeecomputersociety.org/10.1109/ICFHR.2010.119
Perez D, Tarazon L, Serrano N, Castro F, Terrades O, Juan A (2009) The GERMANA database. In: 10th international conference on document analysis and recognition (ICDAR’09), Barcelona, 2009, pp 301–305. doi:10.1109/ICDAR.2009.10
Phillips IT, Chhabra AK (1999) Empirical performance evaluation of graphics recognition systems. IEEE Trans Pattern Anal Mach Intell 21:849–870. doi:http://dx.doi.org/10.1109/34.790427, URL: http://dx.doi.org/10.1109/34.790427
Phillips I, Chen S, Haralick R (1993) CD-ROM document database standard. In: Proceedings of the second international conference on document analysis and recognition, Tsukuba, 1993, pp 478–483. doi:10.1109/ICDAR.1993.395691
Phillips I, Ha J, Haralick R, Dori D (1993) The implementation methodology for a CD-ROM English document database. In: Proceedings of the second international conference on document analysis and recognition, Tsukuba, 1993, pp 484–487. doi:10.1109/ICDAR.1993.395690
Plamondon R, Guerfali W (1998) The generation of handwriting with delta-lognormal synergies. Biol Cybern 132:119–132
Pletschacher S, Antonacopoulos A (2010) The page (page analysis and ground-truth elements) format framework. In: 20th international conference on pattern recognition (ICPR), Istanbul, 2010, pp 257–260. doi:10.1109/ICPR.2010.72
Pratikakis I, Gatos B, Ntirogiannis K (2010) H-DIBCO 2010 – handwritten document image binarization competition. In: International conference on frontiers in handwriting recognition (ICFHR), Kolkata, 2010, pp 727–732. doi:10.1109/ICFHR.2010.118
Pratikakis I, Gatos B, Ntirogiannis K (2011) ICDAR 2011 document image binarization contest (DIBCO 2011). In: International conference on document analysis and recognition (ICDAR), Beijing, 2011, pp 1506–1510. doi:10.1109/ICDAR.2011.299
Quiniou S, Mouchere H, Saldarriaga S, Viard-Gaudin C, Morin E, Petitrenaud S, Medjkoune S (2011) HAMEX – a handwritten and audio dataset of mathematical expressions. In: International conference on document analysis and recognition (ICDAR), Beijing, 2011, pp 452–456. doi:10.1109/ICDAR.2011.97
Rath TM, Manmatha R (2007) Word spotting for historical documents. Int J Doc Anal Recognit 9(2):139–152. doi:10.1007/s10032-006-0027-8, URL: http://dx.doi.org/10.1007/s10032-006-0027-8
Rice SV, Jenkins FR, Nartker TA (1996) The fifth annual test of OCR accuracy. Technical report TR-96-01. AInformation Science Research Institute (University of Nevada, Las Vegas)
Rusiñol M, Borrís A, Lladós J (2010) Relational indexing of vectorial primitives for symbol spotting in line-drawing images. Pattern Recognit Lett 31:188–201. doi:http://dx.doi.org/10.1016/j.patrec.2009.10.002, URL: http://dx.doi.org/10.1016/j.patrec.2009.10.002
Saund E, Lin J, Sarkar P (2009) PixLabeler: user interface for pixel-level labeling of elements in document images. In: Proceedings of the 2009 10th international conference on document analysis and recognition (ICDAR’09), Barcelona. IEEE Computer Society, Washington, DC, pp 646–650. doi:http://dx.doi.org/10.1109/ICDAR.2009.250, URL: http://dx.doi.org/10.1109/ICDAR.2009.250
Schomaker L, Thomassen A, Teulings HL (1989) A computational model of cursive handwriting. In: Plamondon R, Suen CY, Simner ML (eds) Computer recognition and human production of handwriting. World Scientific, Singapore/Teaneck, pp 153–177
Serrano N, Castro F, Juan A (2010) The RODRIGO database. In: LREC, Valletta
Setlur S, Govindaraju V (1994) Generating manifold samples from a handwritten word. Pattern Recognit Lett 15(9):901–905. doi:10.1016/0167-8655(94)90152-X, URL: http://www.sciencedirect.com/science/article/pii/016786559490152X
Shafait F (2007) Document image dewarping contest. In: 2nd international workshop on camera-based document analysis and recognition, Curitiba, pp 181–188
Shahab A, Shafait F, Kieninger T, Dengel A (2010) An open approach towards the benchmarking of table structure recognition systems. In: Proceedings of the 9th IAPR international workshop on document analysis systems (DAS’10), Boston. ACM, New York, pp 113–120. doi:http://doi.acm.org/10.1145/1815330.1815345, URL: http://doi.acm.org/10.1145/1815330.1815345
Smith EHB (2010) An analysis of binarization ground truthing. In: Proceedings of the 9th IAPR international workshop on document analysis systems (DAS’10), Boston. ACM, New York, pp 27–34. doi:http://doi.acm.org/10.1145/1815330.1815334, URL: http://doi.acm.org/10.1145/1815330.1815334
Solimanpour F, Sadri J, Suen CY (2006) Standard databases for recognition of handwritten digits, numerical strings, legal amounts, letters and dates in Farsi language. In: Lorette G (ed) Tenth international workshop on frontiers in handwriting recognition, Université de Rennes 1, Suvisoft, La Baule. URL: http://hal.inria.fr/inria-00103983/en/
Suen C, Nadal C, Legault R, Mai T, Lam L (1992) Computer recognition of unconstrained handwritten numerals. Proc IEEE 80(7):1162–1180. doi:10.1109/5.156477
Todoran L, Worring M, Smeulders M (2005) The UvA color document dataset. Int J Doc Anal Recognit 7:228–240. doi:10.1007/s10032-004-0135-2, URL: http://dl.acm.org/citation.cfm?id=1102243.1102245
Uchida S, Nomura A, Suzuki M (2005) Quantitative analysis of mathematical documents. Int J Doc Anal Recognit 7:211–218. doi:10.1007/s10032-005-0142-y, URL: http://dl.acm.org/citation.cfm?id=1102243.1102248
Varga T, Bunke H (2003) Generation of synthetic training data for an HMM-based handwriting recognition system. In: Proceedings of the seventh international conference on document analysis and recognition (ICDAR’03), Edinburgh, vol 1. IEEE Computer Society, Washington, DC, pp 618–622. URL: http://dl.acm.org/citation.cfm?id=938979.939265
Viard-Gaudin C, Lallican PM, Binter P, Knerr S (1999) The IRESTE On/Off (IRONOFF) dual handwriting database. In: Proceedings of the fifth international conference on document analysis and recognition (ICDAR’99), Bangalore. IEEE Computer Society, Washington, DC, pp 455–458. URL: http://dl.acm.org/citation.cfm?id=839279.840372
Wang K, Belongie S (2010) Word spotting in the wild. In: Proceedings of the 11th European conference on computer vision: part I (ECCV’10), Heraklion. Springer, Berlin/Heidelberg, pp 591–604. URL: http://dl.acm.org/citation.cfm?id=1886063.1886108
Wang J, Wu C, Xu YQ, Shum HY, Ji L (2002) Learning-based cursive handwriting synthesis. In: Proceedings of the eighth international workshop on frontiers of handwriting recognition, Niagara-on-the-Lake, pp 157–162
Wang DH, Liu CL, Yu JL, Zhou XD (2009) CASIA-OLHWDB1: a database of online handwritten Chinese characters. In: Proceedings of the 2009 10th international conference on document analysis and recognition (ICDAR’09), Barcelona. IEEE Computer Society, Washington, DC, pp 1206–1210. doi:http://dx.doi.org/10.1109/ICDAR.2009.163, URL: http://dx.doi.org/10.1109/ICDAR.2009.163
Yang L, Huang W, Tan CL (2006) Semi-automatic ground truth generation for chart image recognition. In: Workshop on document analysis systems (DAS), Nelson, pp 324–335
Yanikoglu BA, Vincent L (1998) Pink panther: a complete environment for ground-truthing and benchmarking document page segmentation. Pattern Recognit 31(9): 1191–1204. doi:10.1016/S0031-3203(97)00137-4, URL: http://www.sciencedirect.com/science/article/pii/S0031320397001374
Zhai J, Wenyin L, Dori D, Li Q (2003) A line drawings degradation model for performance characterization. In: Proceedings of the seventh international conference on document analysis and recognition, Edinburgh, 2003, pp 1020–1024. doi:10.1109/ICDAR.2003.1227813
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer-Verlag London
About this entry
Cite this entry
Valveny, E. (2014). Datasets and Annotations for Document Analysis and Recognition. In: Doermann, D., Tombre, K. (eds) Handbook of Document Image Processing and Recognition. Springer, London. https://doi.org/10.1007/978-0-85729-859-1_32
Download citation
DOI: https://doi.org/10.1007/978-0-85729-859-1_32
Published:
Publisher Name: Springer, London
Print ISBN: 978-0-85729-858-4
Online ISBN: 978-0-85729-859-1
eBook Packages: Computer ScienceReference Module Computer Science and Engineering