Multimedia Tools and Applications

, Volume 77, Issue 7, pp 8441–8473 | Cite as

Benchmark databases of handwritten Bangla-Roman and Devanagari-Roman mixed-script document images

  • Pawan Kumar Singh
  • Ram Sarkar
  • Nibaran Das
  • Subhadip Basu
  • Mahantapas Kundu
  • Mita Nasipuri
Article

Abstract

Handwritten document image dataset is one of the basic necessities to conduct research on developing Optical Character Recognition (OCR) systems. In a multilingual country like India, handwritten documents often contain more than one script, leading to complex pattern analysis problems. In this paper, we highlight two such situations where Devanagari and Bangla scripts, two most widely used scripts in Indian sub-continent, are individually used along with Roman script in documents. We address three key challenges here: 1) collection, compilation and organization of benchmark databases of images of 150 Bangla-Roman and 150 Devanagari-Roman mixed-script handwritten document pages respectively, 2) script-level annotation of 18931 Bangla words, 15528 Devanagari words and 10331 Roman words in those 300 document pages, and 3) development of a bi-script and tri-script word-level script identification module using Modified log-Gabor filter as feature extractor. The technique is statistically validated using multiple classifiers and it is found that Multi-Layer Perceptron (MLP) classifier performs the best. Average word-level script identification accuracies of 92.32%, 95.30% and 93.78% are achieved using 3-fold cross validation for Bangla-Roman, Devanagari-Roman and Bangla-Devanagari-Roman databases respectively. Both the mixed-script document databases along with the script-level annotations and 44790 extracted word images of the three aforementioned scripts are available freely at https://code.google.com/p/cmaterdb/.

Keywords

Script identification Handwritten text Mixed-script documents Optical character recognition Modified log-Gabor filter Transform Statistical significance tests 

Notes

Acknowledgements

The authors are thankful to the CMATER and Project on Storage Retrieval and Understanding of Video for Multimedia (SRUVM) of Computer Science and Engineering Department, Jadavpur University, for providing infrastructure facilities during progress of the work. The current work, reported here, has been partially funded by University with Potential for Excellence (UPE), Phase-II, UGC, Government of India. Also a lot of people helped us to make the database worthy to use. Authors are grateful to everyone who contributed with data to make this project successful.

References

  1. 1.
    Alaei A, Nagabhushan P, Pal U (2011) A benchmark Kannada handwritten document dataset and its segmentation. In: Proc. of 12th IEEE International Conference on Document Analysis and Recognition (ICDAR), pp 141–145Google Scholar
  2. 2.
    Basu S, Das N, Sarkar R, Kundu M, Nasipuri M, Basu DK (2005) An MLP based approach for recognition of handwritten Bangla numerals. In: Proc. of 2nd International Conference on Artificial Intelligence, pp 407–417Google Scholar
  3. 3.
    Bhattacharya U, Chaudhuri BB (2009) Handwritten numeral databases of Indian scripts and multistage recognition of mixed numerals. IEEE Trans Pattern Anal Mach Intell 3(3):444–457CrossRefGoogle Scholar
  4. 4.
    Bishop CM (2006) Pattern recognition and machine learning. In: Information Science and Statistics. Springer Publishers, New YorkGoogle Scholar
  5. 5.
    Breiman L (1996) Bagging predictors. Mach Learn 24(2):123–140MATHGoogle Scholar
  6. 6.
    Breiman L (2001) Random forests. Mach Learn 45(1):5–32CrossRefMATHGoogle Scholar
  7. 7.
    C-Chang C, Lin CJ (2011) LIBSVM: a library for support vector machines. ACM Trans Intell Syst Technol 2(3) article no. 27Google Scholar
  8. 8.
    le Cessie S, van Houwelingen JC (1992) Ridge estimators in logistic regression. Appl Stat 41(1):191–201CrossRefMATHGoogle Scholar
  9. 9.
    Chanda S, Pal U (2005) English, Devnagari and Urdu text identification. In: Proc. of International Conference on Cognition and Recognition, pp 538–545Google Scholar
  10. 10.
    Chanda S, Pal S, Pal U (2008) Word-wise Sinhala, Tamil and English script identification using Gaussian kernel SVM. In: Proc. of 19th IEEE International Conference on Pattern Recognition, pp 1–4Google Scholar
  11. 11.
    Chanda S, Pal S, Franke K, Pal U (2009) Two-stage approach for word-wise script identification. In: Proc. of 10th international Conference on document analysis and recognition (ICDAR), pp 926–930Google Scholar
  12. 12.
    Chaudhari S, Gulati RM (2016) Script identification using Gabor feature and SVM classifier. In: Proc. of  International Conference on Communication, Computing and Virtualization, Procedia Computer Science, vol 79, pp 85–92Google Scholar
  13. 13.
    Chaudhuri BB (2006) A complete handwritten numeral database of Bangla—a major Indic script. In: Proc. of  10th International Workshop on Frontiers of Handwriting Recognition, La Baule, France, pp 379–384Google Scholar
  14. 14.
    Demsar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7:1–30MathSciNetMATHGoogle Scholar
  15. 15.
    Dhandra BV, Nagabhushan P, Hangarge M, Hegadi R, Malemath VS (2006) Script identification based on morphological reconstruction in document images. In: Proc. of IEEE International Conference of Pattern Recognition, Hong Kong, vol 2, pp 950–953Google Scholar
  16. 16.
    Dhandra BV, Mallikarjun H, Hegadi R, Malemath VS (2006) Word-wise script identification from bilingual documents based on morphological reconstruction. In: Proc. of 1st IEEE International Conference on Digital Information Management, pp 389–394Google Scholar
  17. 17.
    Dhanya D, Ramakrishnan AG, Pati PB (2002) Script identification in printed bilingual documents. Sadhana 27(1):73–82CrossRefMATHGoogle Scholar
  18. 18.
    Dunn OJ (1961) Multiple comparisons among means. J Am Stat Assoc 56:52–64MathSciNetCrossRefMATHGoogle Scholar
  19. 19.
    Ester M, Kriegel HP, Sander J, Xu X (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proc. of 2nd International Conference on Knowledge Discovery and Data Mining, vol 96, pp 226–231Google Scholar
  20. 20.
    Field DJ (1987) Relations between the statistics of natural images and the response properties of cortical cells. J Opt Soc Am A 4:2379–2394CrossRefGoogle Scholar
  21. 21.
    Friedman M (1937) The use of ranks to avoid the assumption of normality implicit in the analysis of variance. J Am Stat Assoc 32:675–701CrossRefMATHGoogle Scholar
  22. 22.
    Gonzalez RC, Woods RE (1992) Digital Image Processing, 1st Edn. Prentice-Hall, IndiaGoogle Scholar
  23. 23.
    Harris C, Stephens M (1988) A combined corner and edge detector. In: Alvey vision Conference, vol 15Google Scholar
  24. 24.
    Hassan E, Garg R, Chaudhury S, Gopal M (2011) Script based Text Identification: A Multi-level Architecture. In: Proc. of the 2011 Joint Workshop on multilingual OCR and analytics for noisy unstructured text data. Beijing, ChinaGoogle Scholar
  25. 25.
    Hiremath PS, Shivashankar S (2008) Wavelet based co-occurrence histogram features for texture classification with an application to script identification in document image. Pattern Recogn Lett 29(9):1182–1189CrossRefGoogle Scholar
  26. 26.
    Hiremath PS, Shivshankar S, Pujari JD, Mouneswara V (2010) Script identification in a handwritten document image using texture features. In: Proc. of 2nd IEEE International Conference on Advance Computing, pp 110–114Google Scholar
  27. 27.
    Iman RL, Davenport JM (1980) Approximations of the critical region of the Friedman statistic. Commun Stat 9(6):571–595Google Scholar
  28. 28.
    Jayadevan R, Kohle SR, Patil PM (2011) Database development and recognition of handwritten Devanagari legal amount words. In: Proc. of 12th IEEE International Conference on Document Analysis and Recognition, pp 304–308Google Scholar
  29. 29.
    Jindal M, Hemrajani N (2013) Script identification for printed document images at text-line level using DCT and PCA. IOSR J Comput Eng 12(5):97–102CrossRefGoogle Scholar
  30. 30.
    John GH, Langley P (1995) Estimating continuous distributions in Bayesian classifiers. In: Proc. of 11th Conference on Uncertainty in Artificial Intelligence, San Mateo, pp 338–345Google Scholar
  31. 31.
    Joshi GD, Garg S, Sivaswamy J (2006) Script identification from Indian documents. In: Lecture Notes in Computer Science: International Workshop Document Analysis Systems, Nelson, LNCS-3872, pp 255–267Google Scholar
  32. 32.
    Languages spoken by more than 10 million people. Encarta Encyclopedia (2007) Retrieved 3 Aug 2016Google Scholar
  33. 33.
    Moravec H (1980) Obstacle avoidance and navigation in the real world by a seeing robot rover. In: Tech report CMU-RI-TR-3 Carnegie-Mellon University, robotics instituteGoogle Scholar
  34. 34.
    Nemenyi PB (1963) Distribution-free multiple comparisons. PhD thesis, Princeton UniversityGoogle Scholar
  35. 35.
    Nethravathi B, Archana CP, Shashikiran K, Ramakrishnan AG, Kumar V (2010) Creation of a huge annotated database for Tamil and Kannada OHR. In: Proc. of International Conference on Frontiers in Handwriting Recognition (ICFHR), pp 415–420Google Scholar
  36. 36.
    Obaidullah SM, Kundu SK, Roy K (2013) A system for handwritten script identification from Indian document. J Pattern Recognit Res 8:1–12CrossRefGoogle Scholar
  37. 37.
    Padma MC, Vijaya PA (2009) Identification of Telugu, Devnagari and English scripts using discriminating features. Int J Comp Sci Inf Technol 1(2):64–78Google Scholar
  38. 38.
    Padma MC, Vijaya PA (2010) Global approach for script identification using wavelet packet based features. Int J Sig Process, Image Process Pattern Recognit 3(3):29–40Google Scholar
  39. 39.
    Padma MC, Vijaya PA (2010) Script identification from trilingual documents using profile based features. Int J Comput Sci Appl (IJCSA) 7(4):16–33Google Scholar
  40. 40.
    Padma MC, Vijaya PA (2010) Script identification of text words from a tri lingual document using voting technique. Int J Image Process 4(1):35–52Google Scholar
  41. 41.
    Pal U, Chaudhuri BB (1997) Automatic separation of words in multi lingual multi script Indian documents. In: Proc. of 4th  IEEE International Conference on Document Analysis and Recognition (ICDAR), pp 576–579Google Scholar
  42. 42.
    Pal U, Sinha S, Chaudhuri BB (2003) Multi-script line identification from Indian documents. In: Proc. of 7th IEEE International Conference on Document Analysis and Recognition (ICDAR), pp 880–884Google Scholar
  43. 43.
    Pal U, Sharma N, Wakabayashi T, Kimura F (2007) Handwritten numeral recognition of six popular Indian scripts. In: Proc. of 9th  IEEE International Conference on Document Analysis and Recognition (ICDAR), pp 749–753Google Scholar
  44. 44.
    Pardeshi R, Chaudhuri BB, Hangarge M, Santosh KC (2014) Automatic handwritten Indian scripts identification. In: Proc. of 14th International Conference on Frontiers in Handwriting Recognition, pp 375-380Google Scholar
  45. 45.
    Pati PB, Ramakrishnan AG (2006) HVS inspired system for script identification in Indian multi-script documents. In: Lecture Notes in Computer Science: International Workshop Document Analysis Systems, Nelson, LNCS-3872, pp 380–389Google Scholar
  46. 46.
    Pati PB, Ramakrishnan AG (2008) Word level multi-script identification. Pattern Recogn Lett 29(9):1218–1229CrossRefGoogle Scholar
  47. 47.
    Patil SB, Subbareddy NV (2002) Neural network based system for script identification in Indian documents. Sadhana 27(1):83–97CrossRefGoogle Scholar
  48. 48.
    Rish I (2001) An empirical study of the naive Bayes classifier. In: IJCAI Workshop on Empirical Methods in AIGoogle Scholar
  49. 49.
    Roy K, Pal U (2006) Word-wise handwritten script separation for Indian postal automation. In: Proc. of 10th International Workshop on Frontiers in Handwriting Recognition, La Baule, pp 521–526Google Scholar
  50. 50.
    Roy K, Das SK, Obaidullah Sk Md (2011) Script identification from handwritten documents. In: Proc. of 3rd IEEE National Conference on Computer Vision, Pattern Recognition, Image Processing and Graphics, Hubli, Karnataka, pp. 66–69Google Scholar
  51. 51.
    Sarkar R, Das N, Basu S, Kundu M, Nasipuri M, Basu DK (2010) Word level script identification from Bangla and Devnagari handwritten texts mixed with Roman scripts. J Comput 2(2):103–108Google Scholar
  52. 52.
    Sarkar R, Das N, Basu S, Kundu M, Nasipuri M, Basu DK (2012) CM​ATERdb1:a database of unconstrained handwritten Bangla and Bangla-English mixed script document image. Int J Doc Anal Recognit 15(1):71–83Google Scholar
  53. 53.
    Singh PK, Sarkar R, Das N, Basu S, Nasipuri M (2013) Identification of Devnagari and Roman scripts from multi-script handwritten documents. In: Proc. of 5th International Conference on pattern recognition and machine Intelligence (PReMI). LNCS 8251:509–514Google Scholar
  54. 54.
    Singh PK, Sarkar R, Das N, Basu S, Nasipuri M (2014) Statistical comparison of classifiers for script identification from multi-script handwritten documents. Int J Appl Pattern Recognit 1(2):152–172CrossRefGoogle Scholar
  55. 55.
    Singh PK, Sarkar R, Nasipuri M (2015) Offline script identification from multilingual Indic-script documents: a state-of-the-art. In: Computer Science Review, Elsevier 15–16:1–28Google Scholar
  56. 56.
    Singh PK, Dalal SK, Sarkar R, Nasipuri M (2015) Page-level Script identification from Multi-script handwritten documents. In: Proc. of 3rd IEEE International Conference on Computer, Communication, Control and Information Technology (C3IT), pp 1–6Google Scholar
  57. 57.
    Singh PK, Sarkar R, Nasipuri M (2015) Line-level script identification for six handwritten scripts using texture based features. In: Proc. of  2nd Information Systems Design and Intelligent Applications. Adv Intell Syst Comput 340:285–293Google Scholar
  58. 58.
    Singh PK, Chatterjee I, Sarkar R (2015) Page-level handwritten script identification using Modified log-Gabor filter based features. In: Proc. of 2nd IEEE International Conference on Recent Trends in Information Systems (ReTIS), pp 225–230Google Scholar
  59. 59.
    Singh PK, Chowdhury SP, Sinha S, Eum S, Sarkar R (2017) Page-to-word extraction from unconstrained handwritten document images. In: Proc. of 1st International Conference on Intelligent Computing and Communication (ICIC2), AISC 458, pp. 517-524.Google Scholar

Copyright information

© Springer Science+Business Media New York 2017

Authors and Affiliations

  • Pawan Kumar Singh
    • 1
  • Ram Sarkar
    • 1
  • Nibaran Das
    • 1
  • Subhadip Basu
    • 1
  • Mahantapas Kundu
    • 1
  • Mita Nasipuri
    • 1
  1. 1.Department of Computer Science and EngineeringJadavpur UniversityKolkataIndia

Personalised recommendations