Handwritten multilingual word segmentation using polygonal approximation of digital curves for Indian languages

Abstract

Multilingual Optical Character Recognition (OCR) is difficult to develop as different languages exhibit different writing and structural characteristics and it is very difficult to generalize their segmentation process. Character segmentation plays an important role in developing OCR for handwritten languages. The exactness of character segmentation is the integral factor of OCR. In this paper, we exploit this limitation and propose a approach based on the polygonal approximation of the word, which works on more than one Indian languages. This work depicts the novel approach for script independent character segmentation of handwritten text utilizing basic structural properties of the languages. Digitally straight line segments (DSS) of the word is obtained by applying Polygonal approximation to the word. The segmentation of character is language independent and works considerably with skew words as well. Experiments are carried out with four popular Indian languages, Hindi, Marathi, Punjabi, and Bangla. The average success rate for character segmentation of four languages is 90.07% which is satisfactory compared with other existing methods. We use shadow and cumulative stretch feature set with random forest, support vector machine (SVM), multi-layer perceptron (MLP), and convolutional neural network (CNN) classifiers for character recognition. On experimentation, it is observed that our proposed method provided good accuracy for character segmentation and recognition.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19

References

  1. 1.

    Arefin N, Hassan M, Khaliluzzaman M, Chowdhury SA (2017) Bangla handwritten characters recognition by using distance-based segmentation and histogram oriented gradients. In: IEEE Region 10 humanitarian technology conference, pp 678–681

  2. 2.

    Arya D, Jawahar C, Bhagvati C, Patnaik T, Chaudhuri B, Lehal G, Chaudhury S, Ramakrishna A (2011) Experiences of integration and performance testing of multilingual OCR for printed Indian scripts. In: Joint workshop on multilingual OCR and analytics for noisy unstructured text data, 9

  3. 3.

    Bag S, Bhowmick P, Harit G, Biswas A (2011) Character segmentation of handwritten Bangla text by vertex characterization of isothetic covers. In: National conference on computer vision, pattern recognition, image processing and graphics, pp 21–24

  4. 4.

    Bag S, Harit G (2013) A survey on optical character recognition for Bangla and Devanagari scripts. Sadhana 38(1):133–168

    Article  Google Scholar 

  5. 5.

    Bag S, Krishna A (2015) Character segmentation of Hindi unconstrained handwritten words. In: International workshop on combinatorial image analysis, pp 247–260

  6. 6.

    Bansal V, Sinha R (2002) Segmentation of touching and fused Devanagari characters. Pattern Recogn 35(4):875–893

    Article  MATH  Google Scholar 

  7. 7.

    Basu S, Sarkar R, Das N, Kundu M, Nasipuri M, Basu DK (2007) A fuzzy technique for segmentation of handwritten Bangla word images. In: International conference on computing: theory and applications, pp 427–433

  8. 8.

    Bhattad AJ, Chaudhuri B (2015) An approach for character segmentation of handwritten Bangla and Devanagari script. In: International conference on advance computing conference, pp 676–680

  9. 9.

    Bhowmick P, Bhattacharya BB (2007) Fast polygonal approximation of digital curves using relaxed straightness properties. IEEE Trans Pattern Anal Mach Intell 29 (9):1590–1602

    Article  Google Scholar 

  10. 10.

    Bishnu A, Chaudhuri B (1999) Segmentation of Bangla handwritten text into characters by recursive contour following. In: International conference on document analysis and recognition, pp 402–405

  11. 11.

    Bunke H (2003) Recognition of cursive Roman handwriting: past, present and future. In: International conference on document analysis and recognition, pp 448–459

  12. 12.

    Casey RG, Lecolinet E (1995) Strategies in character segmentation: a survey. In: International conference on document analysis and recognition, vol 2, pp 1028–1033

  13. 13.

    Chaudhuri B, Pal U (1997) An OCR system to read two Indian language scripts: Bangla and Devnagari (Hindi). In: International conference on document analysis and recognition, vol 2, pp 1011–1015

  14. 14.

    Das N, Das B, Sarkar R, Basu S, Kundu M, Nasipuri M (2010) Handwritten Bangla basic and compound character recognition using MLP and SVM classifier. arXiv:1002.4040

  15. 15.

    Dershowitz N, Rosenberg A (2014) Arabic character recognition. In: Language, culture, computation. Computing-theory and technology, pp 584–602

  16. 16.

    Gao Y, Yang Y (2004) Survey of unconstrained handwritten Chinese character segmentation. Comput Eng 5:052

    Google Scholar 

  17. 17.

    Garain U, Chaudhuri B (2002) Segmentation of touching characters in printed Devnagari and Bangla scripts using fuzzy multifactorial analysis. IEEE Trans Syst Man Cybern Part C Appl Rev 32(4):449– 459

    Article  Google Scholar 

  18. 18.

    Hanmandlu M, Agrawal P (2005) A structural approach for segmentation of handwritten Hindi text. In: International conference on cognition and recognition, pp 589–597

  19. 19.

    https://en.wikipedia.org/wiki/Marathi_language. Accessed 23 Jan 2018

  20. 20.

    https://en.wikipedia.org/wiki/Punjabi_language. Accessed 23 Jan 2018

  21. 21.

    https://en.wikipedia.org/wiki/Bengali_language. Accessed 23 Jan 2018

  22. 22.

    Jawahar C, Kumar MP, Kiran SR (2003) A bilingual OCR for Hindi-Telugu documents and its applications. In: International conference on document analysis and recognition, pp 408–412

  23. 23.

    Jayadevan R, Kolhe SR, Patil PM, Pal U (2011) Offline recognition of Devanagari script: a survey. IEEE Trans Syst Man Cybern Part C Appl Rev 41 (6):782–796

    Article  Google Scholar 

  24. 24.

    Khorsheed MS (2002) Off-line Arabic character recognition–a review. Pattern Anal Applic 5(1):31–45

    MathSciNet  Article  Google Scholar 

  25. 25.

    Kumar V, Senegar PK (2010) Segmentation of printed text in Devnagari script and Gurmukhi script. Int J Comput Appl 3:24–29

    Google Scholar 

  26. 26.

    Lehal GS (2009) A complete machine-printed Gurmukhi OCR system. In: Guide to OCR for Indic scripts, pp 43–71

  27. 27.

    Lehal GS, Singh C (2000) A Gurmukhi script recognition system. In: International conference on pattern recognition, vol 2, pp 557–560

  28. 28.

    Ma H, Doermann D (2003) Adaptive Hindi OCR using generalized Hausdorff image comparison. ACM Transactions on Asian Language Information Processing 2 (3):193–218

    Article  Google Scholar 

  29. 29.

    Mangla P, Kaur H (2014) An end detection algorithm for segmentation of broken and touching characters in handwritten Gurumukhi word. In: International conference on reliability, infocom technologies and optimization, pp 1–4

  30. 30.

    Mohanty S, Dasbebartta HN, Behera TK (2009) An efficient bilingual optical character recognition (English-Oriya) system for printed documents. In: International conference on advances in pattern recognition, pp 398–401

  31. 31.

    Nawab NB, Hassan M (2012) Optical Bangla character recognition using chain-code. In: International conference on informatics, electronics & vision, pp 622–627

  32. 32.

    Obaidullah SM, Halder C, Santosh K, Das N, Roy K (2017) Phdindic_11: page-level handwritten document image dataset of 11 official Indic scripts for script identification. Multimed Tools Appl: 1–36

  33. 33.

    Otsu N (1979) A threshold selection method from gray-level histograms. IEEE Trans Syst Man Cybern 9(1):62–66

    Article  Google Scholar 

  34. 34.

    Pal U, Chaudhuri B (2004) Indian script character recognition: a survey. Pattern Recogn 37(9):1887–1899

    Article  Google Scholar 

  35. 35.

    Pal U, Datta S (2003) Segmentation of Bangla unconstrained handwritten text. In: International conference on document analysis and recognition, pp 1128–1132

  36. 36.

    Palakollu S, Dhir R, Rani R (2012) Handwritten Hindi text segmentation techniques for lines and characters. In: World congress on engineering and computer science, vol 1, pp 24–26

  37. 37.

    Patel C, Desai A (2010) Segmentation of text lines into words for Gujarati handwritten text. In: International conference on signal and image processing, pp 130–134

  38. 38.

    Pramanik R, Bag S (2018) Shape decomposition-based handwritten compound character recognition for Bangla OCR. J Vis Commun Image Represent 50:123–134

    Article  Google Scholar 

  39. 39.

    Pramanik R, Bag S (2017) Linear curve fitting-based headline estimation in handwritten words for Indian scripts. In: International conference on pattern recognition and machine intelligence, pp 116–123

  40. 40.

    Pramanik R, Raj V, Bag S (2018) Finding the optimum classifier: Classification of segmentable components in offline handwritten Devanagari words. In: International conference on recent advances in information technology, pp 1–5

  41. 41.

    Ramteke S, Gurjar A, Deshmukh D (2016) Automatic segmentation of content and noncontent based handwritten Marathi text document. In: International conference on global trends in signal processing, information computing and communication, pp 404–408

  42. 42.

    Roy A, Bhowmik TK, Parui SK, Roy U (2005) A novel approach to skew detection and character segmentation for handwritten Bangla words. In: Digital image computing: Techniques and applications, pp 30–38

  43. 43.

    Sarkar R, Das N, Basu S, Kundu M, Nasipuri M, Basu DK (2012) Cmaterdb1: a database of unconstrained handwritten Bangla and Bangla–English mixed script document image. Int J Doc Anal Recognit 15(1):71–83

    Article  Google Scholar 

  44. 44.

    Sarkar R, Sen B, Das N, Basu S (2015) Handwritten Devanagari script segmentation: A non-linear fuzzy approach. arXiv:1501.05472

  45. 45.

    Sharma DV, Lehal GS (2006) An iterative algorithm for segmentation of isolated handwritten words in Gurmukhi script. In: International conference on pattern recognition, vol 2, pp 1022–1025

  46. 46.

    Shinde AB, Dandawate YH (2014) Shirorekha extraction in character segmentation for printed Devanagri text in document image processing. In: Annual IEEE India conference, pp 1–7

  47. 47.

    Srivastav A, Sahu N (2016) Segmentation of Devanagari handwritten characters. Int J Comput Appl 142(14)

  48. 48.

    Wang SH, Phillips P, Dong ZC, Zhang YD (2018) Intelligent facial emotion recognition based on stationary wavelet entropy and Jaya algorithm. Neurocomputing 272:668–676

    Article  Google Scholar 

  49. 49.

    Zhang T, Suen CY (1984) A fast parallel algorithm for thinning digital patterns. Commun ACM 27(3):236–239

    Article  Google Scholar 

  50. 50.

    Zhang YD, Sun J (2018) Preliminary study on angiosperm genus classification by weight decay and combination of most abundant color index with fractional Fourier entropy. Multimed Tools Appl 77(17):22671–22688

    Article  Google Scholar 

Download references

Author information

Affiliations

Authors

Corresponding author

Correspondence to Deepika Gupta.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Gupta, D., Bag, S. Handwritten multilingual word segmentation using polygonal approximation of digital curves for Indian languages. Multimed Tools Appl 78, 19361–19386 (2019). https://doi.org/10.1007/s11042-019-7286-0

Download citation

Keywords

  • Character segmentation
  • Deep learning
  • Handwritten
  • Indian languages
  • Multilingual
  • OCR
  • Script independent