Advertisement

Multimedia Tools and Applications

, Volume 78, Issue 14, pp 19361–19386 | Cite as

Handwritten multilingual word segmentation using polygonal approximation of digital curves for Indian languages

  • Deepika GuptaEmail author
  • Soumen Bag
Article
  • 106 Downloads

Abstract

Multilingual Optical Character Recognition (OCR) is difficult to develop as different languages exhibit different writing and structural characteristics and it is very difficult to generalize their segmentation process. Character segmentation plays an important role in developing OCR for handwritten languages. The exactness of character segmentation is the integral factor of OCR. In this paper, we exploit this limitation and propose a approach based on the polygonal approximation of the word, which works on more than one Indian languages. This work depicts the novel approach for script independent character segmentation of handwritten text utilizing basic structural properties of the languages. Digitally straight line segments (DSS) of the word is obtained by applying Polygonal approximation to the word. The segmentation of character is language independent and works considerably with skew words as well. Experiments are carried out with four popular Indian languages, Hindi, Marathi, Punjabi, and Bangla. The average success rate for character segmentation of four languages is 90.07% which is satisfactory compared with other existing methods. We use shadow and cumulative stretch feature set with random forest, support vector machine (SVM), multi-layer perceptron (MLP), and convolutional neural network (CNN) classifiers for character recognition. On experimentation, it is observed that our proposed method provided good accuracy for character segmentation and recognition.

Keywords

Character segmentation Deep learning Handwritten Indian languages Multilingual OCR Script independent 

Notes

References

  1. 1.
    Arefin N, Hassan M, Khaliluzzaman M, Chowdhury SA (2017) Bangla handwritten characters recognition by using distance-based segmentation and histogram oriented gradients. In: IEEE Region 10 humanitarian technology conference, pp 678–681Google Scholar
  2. 2.
    Arya D, Jawahar C, Bhagvati C, Patnaik T, Chaudhuri B, Lehal G, Chaudhury S, Ramakrishna A (2011) Experiences of integration and performance testing of multilingual OCR for printed Indian scripts. In: Joint workshop on multilingual OCR and analytics for noisy unstructured text data, 9Google Scholar
  3. 3.
    Bag S, Bhowmick P, Harit G, Biswas A (2011) Character segmentation of handwritten Bangla text by vertex characterization of isothetic covers. In: National conference on computer vision, pattern recognition, image processing and graphics, pp 21–24Google Scholar
  4. 4.
    Bag S, Harit G (2013) A survey on optical character recognition for Bangla and Devanagari scripts. Sadhana 38(1):133–168CrossRefGoogle Scholar
  5. 5.
    Bag S, Krishna A (2015) Character segmentation of Hindi unconstrained handwritten words. In: International workshop on combinatorial image analysis, pp 247–260Google Scholar
  6. 6.
    Bansal V, Sinha R (2002) Segmentation of touching and fused Devanagari characters. Pattern Recogn 35(4):875–893CrossRefzbMATHGoogle Scholar
  7. 7.
    Basu S, Sarkar R, Das N, Kundu M, Nasipuri M, Basu DK (2007) A fuzzy technique for segmentation of handwritten Bangla word images. In: International conference on computing: theory and applications, pp 427–433Google Scholar
  8. 8.
    Bhattad AJ, Chaudhuri B (2015) An approach for character segmentation of handwritten Bangla and Devanagari script. In: International conference on advance computing conference, pp 676–680Google Scholar
  9. 9.
    Bhowmick P, Bhattacharya BB (2007) Fast polygonal approximation of digital curves using relaxed straightness properties. IEEE Trans Pattern Anal Mach Intell 29 (9):1590–1602CrossRefGoogle Scholar
  10. 10.
    Bishnu A, Chaudhuri B (1999) Segmentation of Bangla handwritten text into characters by recursive contour following. In: International conference on document analysis and recognition, pp 402–405Google Scholar
  11. 11.
    Bunke H (2003) Recognition of cursive Roman handwriting: past, present and future. In: International conference on document analysis and recognition, pp 448–459Google Scholar
  12. 12.
    Casey RG, Lecolinet E (1995) Strategies in character segmentation: a survey. In: International conference on document analysis and recognition, vol 2, pp 1028–1033Google Scholar
  13. 13.
    Chaudhuri B, Pal U (1997) An OCR system to read two Indian language scripts: Bangla and Devnagari (Hindi). In: International conference on document analysis and recognition, vol 2, pp 1011–1015Google Scholar
  14. 14.
    Das N, Das B, Sarkar R, Basu S, Kundu M, Nasipuri M (2010) Handwritten Bangla basic and compound character recognition using MLP and SVM classifier. arXiv:1002.4040
  15. 15.
    Dershowitz N, Rosenberg A (2014) Arabic character recognition. In: Language, culture, computation. Computing-theory and technology, pp 584–602Google Scholar
  16. 16.
    Gao Y, Yang Y (2004) Survey of unconstrained handwritten Chinese character segmentation. Comput Eng 5:052Google Scholar
  17. 17.
    Garain U, Chaudhuri B (2002) Segmentation of touching characters in printed Devnagari and Bangla scripts using fuzzy multifactorial analysis. IEEE Trans Syst Man Cybern Part C Appl Rev 32(4):449– 459CrossRefGoogle Scholar
  18. 18.
    Hanmandlu M, Agrawal P (2005) A structural approach for segmentation of handwritten Hindi text. In: International conference on cognition and recognition, pp 589–597Google Scholar
  19. 19.
  20. 20.
  21. 21.
  22. 22.
    Jawahar C, Kumar MP, Kiran SR (2003) A bilingual OCR for Hindi-Telugu documents and its applications. In: International conference on document analysis and recognition, pp 408–412Google Scholar
  23. 23.
    Jayadevan R, Kolhe SR, Patil PM, Pal U (2011) Offline recognition of Devanagari script: a survey. IEEE Trans Syst Man Cybern Part C Appl Rev 41 (6):782–796CrossRefGoogle Scholar
  24. 24.
    Khorsheed MS (2002) Off-line Arabic character recognition–a review. Pattern Anal Applic 5(1):31–45MathSciNetCrossRefGoogle Scholar
  25. 25.
    Kumar V, Senegar PK (2010) Segmentation of printed text in Devnagari script and Gurmukhi script. Int J Comput Appl 3:24–29Google Scholar
  26. 26.
    Lehal GS (2009) A complete machine-printed Gurmukhi OCR system. In: Guide to OCR for Indic scripts, pp 43–71Google Scholar
  27. 27.
    Lehal GS, Singh C (2000) A Gurmukhi script recognition system. In: International conference on pattern recognition, vol 2, pp 557–560Google Scholar
  28. 28.
    Ma H, Doermann D (2003) Adaptive Hindi OCR using generalized Hausdorff image comparison. ACM Transactions on Asian Language Information Processing 2 (3):193–218CrossRefGoogle Scholar
  29. 29.
    Mangla P, Kaur H (2014) An end detection algorithm for segmentation of broken and touching characters in handwritten Gurumukhi word. In: International conference on reliability, infocom technologies and optimization, pp 1–4Google Scholar
  30. 30.
    Mohanty S, Dasbebartta HN, Behera TK (2009) An efficient bilingual optical character recognition (English-Oriya) system for printed documents. In: International conference on advances in pattern recognition, pp 398–401Google Scholar
  31. 31.
    Nawab NB, Hassan M (2012) Optical Bangla character recognition using chain-code. In: International conference on informatics, electronics & vision, pp 622–627Google Scholar
  32. 32.
    Obaidullah SM, Halder C, Santosh K, Das N, Roy K (2017) Phdindic_11: page-level handwritten document image dataset of 11 official Indic scripts for script identification. Multimed Tools Appl: 1–36Google Scholar
  33. 33.
    Otsu N (1979) A threshold selection method from gray-level histograms. IEEE Trans Syst Man Cybern 9(1):62–66CrossRefGoogle Scholar
  34. 34.
    Pal U, Chaudhuri B (2004) Indian script character recognition: a survey. Pattern Recogn 37(9):1887–1899CrossRefGoogle Scholar
  35. 35.
    Pal U, Datta S (2003) Segmentation of Bangla unconstrained handwritten text. In: International conference on document analysis and recognition, pp 1128–1132Google Scholar
  36. 36.
    Palakollu S, Dhir R, Rani R (2012) Handwritten Hindi text segmentation techniques for lines and characters. In: World congress on engineering and computer science, vol 1, pp 24–26Google Scholar
  37. 37.
    Patel C, Desai A (2010) Segmentation of text lines into words for Gujarati handwritten text. In: International conference on signal and image processing, pp 130–134Google Scholar
  38. 38.
    Pramanik R, Bag S (2018) Shape decomposition-based handwritten compound character recognition for Bangla OCR. J Vis Commun Image Represent 50:123–134CrossRefGoogle Scholar
  39. 39.
    Pramanik R, Bag S (2017) Linear curve fitting-based headline estimation in handwritten words for Indian scripts. In: International conference on pattern recognition and machine intelligence, pp 116–123Google Scholar
  40. 40.
    Pramanik R, Raj V, Bag S (2018) Finding the optimum classifier: Classification of segmentable components in offline handwritten Devanagari words. In: International conference on recent advances in information technology, pp 1–5Google Scholar
  41. 41.
    Ramteke S, Gurjar A, Deshmukh D (2016) Automatic segmentation of content and noncontent based handwritten Marathi text document. In: International conference on global trends in signal processing, information computing and communication, pp 404–408Google Scholar
  42. 42.
    Roy A, Bhowmik TK, Parui SK, Roy U (2005) A novel approach to skew detection and character segmentation for handwritten Bangla words. In: Digital image computing: Techniques and applications, pp 30–38Google Scholar
  43. 43.
    Sarkar R, Das N, Basu S, Kundu M, Nasipuri M, Basu DK (2012) Cmaterdb1: a database of unconstrained handwritten Bangla and Bangla–English mixed script document image. Int J Doc Anal Recognit 15(1):71–83CrossRefGoogle Scholar
  44. 44.
    Sarkar R, Sen B, Das N, Basu S (2015) Handwritten Devanagari script segmentation: A non-linear fuzzy approach. arXiv:1501.05472
  45. 45.
    Sharma DV, Lehal GS (2006) An iterative algorithm for segmentation of isolated handwritten words in Gurmukhi script. In: International conference on pattern recognition, vol 2, pp 1022–1025Google Scholar
  46. 46.
    Shinde AB, Dandawate YH (2014) Shirorekha extraction in character segmentation for printed Devanagri text in document image processing. In: Annual IEEE India conference, pp 1–7Google Scholar
  47. 47.
    Srivastav A, Sahu N (2016) Segmentation of Devanagari handwritten characters. Int J Comput Appl 142(14)Google Scholar
  48. 48.
    Wang SH, Phillips P, Dong ZC, Zhang YD (2018) Intelligent facial emotion recognition based on stationary wavelet entropy and Jaya algorithm. Neurocomputing 272:668–676CrossRefGoogle Scholar
  49. 49.
    Zhang T, Suen CY (1984) A fast parallel algorithm for thinning digital patterns. Commun ACM 27(3):236–239CrossRefGoogle Scholar
  50. 50.
    Zhang YD, Sun J (2018) Preliminary study on angiosperm genus classification by weight decay and combination of most abundant color index with fractional Fourier entropy. Multimed Tools Appl 77(17):22671–22688CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2019

Authors and Affiliations

  1. 1.Department of Computer Science and EngineeringIndian Institute of Technology (ISM)DhanbadIndia

Personalised recommendations