Advertisement

Improved Zoning and Cropping Techniques Facilitating Segmentation

  • Monika Kohli
  • Satish Kumar
Conference paper
Part of the Communications in Computer and Information Science book series (CCIS, volume 955)

Abstract

In the advent of digital computers and era where work force is shifted to be inclined on robotic process, Optical Character Recognition (OCR) has immense potentials to ease some these processes. Segmentation is one of the pre-processing phases- the pivotal essence of the process where lingual scripts and their characteristics vary to a much larger extent. This paper focuses on techniques which facilitates segmentation in Devanagari script (Hindi) for offline handwritten words i.e. Headline detection in handwritten word images of Hindi for extracting upper and middle zone characters and cropping. Experiments are performed on the handwritten legal amount words ICDAR database [1] on 106 words by 80 writers and on Self created touching character database on 106 words by 15 writers. The proposed zoning technique i.e. CPT (Continuous pixel technique) and cropping techniques is implemented on 10070 and 530 legal amount words with 98.89% accuracy and 80.94% respectively.

Keywords

Handwritten data Optical character recognition (OCR) Segmentation Zoning 

1 Introduction

The prominence of utility of mother tongue has been increasing in past half decade, has opened opportunities for tools inclined in providing functionality of OCR. Thus the process involved in recognizing native linguistic characters is a major challenge that has been prevailing for more than a decade. The research involves exploring phases for optical character recognition is pre-processing & segmentation, feature extraction and recognition. The recognition accuracy of character can be enhanced by applying improved pre-processing & segmentation technique to extract appropriate features.

Moreover, recognition of handwritten words is complex activity where achieving accuracy for multilingual script is more challenging. Offline handwritten data adds more complication and reduced recognition accuracy as compare to online data due to varying stroke width, writing style, pen/pencil used, mood of the writer, paper used, skew.

According to the literature survey, improper segmentation in handwritten data is a major topic of research. S. Kumar [2] discussed various irregularities in Devanagari script. Number of papers are available dealing with segmentation problem in Roman characters/numerals [3, 4, 5, 6, 7] and in other Indian languages[8, 9, 10] but very few are available in Devanagari script (Hindi) [11, 12].

Zoning is required to divide words into upper zone, middle zone and lower zone. Literature survey has shown that Horizontal Profile, Vertical profile, Hough transform are used for zoning.

Due to the presence of large character set in Devanagari script (Hindi), segmentation becomes more complicated. Character set in Hindi not only consists of characters but includes vowels, consonants, conjuncts, compound characters, modifiers. The paper is an attempt to propose techniques for zoning (CPT), to detect Headline and perform zoning. The paper is organized as follows. Section 2 gives the basic technique used in the literature for pre-processing, formulation of Headline detection for zoning to extract middle and upper zone characters, cropping. Section 3 introduces the database used and experimental results. Conclusion and future scope is given in Sect. 4.

2 Pre-processing

Character segmentation is an approach which decomposes an image consisting sequence of characters into sub-images or sub-units for recognition. Pre-processing consists of scanning the handwritten document and processes it to Binarization, remove noise, skew, smoothing the image, Headline detection/removal, zoning which aid segmentation. Variations in the handwritten data necessitate addressing various intricacies of handwritten text.

Detection of Headline is a challenging task in handwritten text. Soumen et al. [11] used thinning and global max density of a row for removal of Headline with accuracy of 95.45% on 11550 words. Detection of header line is performed after straightening is proposed in [13]. Garg et al. [14] used two-stripe projection for Headline detection. Contour-tracing used in [15] used structural approach for the detection of Headline by finding the maximum number of pixels in a row. The detection of Headline by finding the row with maximum pixel density is widely used by authors but fails for skew variable text.

In literature, Hough Transform is used for skew detection and line segmentation. Continuous Pixel Technique (CPT) - based on Hough Transform function is used for Headline line detection is proposed in this paper which facilitates zoning. The proposed algorithm eliminate the problem due to unusual size of upper modifier.

Formal definition of Line detection is given as:

The function uses the polar representation of lines to represent lines in binary image.
$$ \text{x}*\cos\uptheta = \text{y}*\sin\uptheta $$

The function returns an accumulator array based on the threshold value which determines minimum number of pixels that belong to line in an image space. θ defines the angle of detected lines in the polar coordinates system. It gives set of all straight line at a single point in plane corresponds to a sinusoidal curve which is unique to that point. Line detection and zoning using CPT (Continuous Pixel technique) is given in Sect. 2.1.

2.1 Zoning - CPT (Continuous Pixels Technique)

Devanagari script consists of core character in the middle strip and optional modifiers above and below the core characters. Characters form a word when they are joined by a Headline (‘shirorekha’). The purpose of zoning is to extract middle, upper and lower components. The proposed technique for zoning is applied on Binarized, smoothened image.

Step1. Contiguous pixel with maximum length is calculated using Houghline detection algorithm.

Step2. Threshold of value 20 is used for finding Houghlines.

Step3. The result of the step I and step II, results a row connecting common row between the upper and the middle zone of the word.

Step4. The upper zone is considered for further manipulation as it contains matra/matras/Chandra bindu/bindu else it will be discarded on the basis of number of rows of white pixels in the upper zone. The Figs. 1a, b, c shows the result of the applied algorithm.
Fig. 1.

(a) Images containing word matra in the upper zone, (b) Images containing word without matra in the upper zone, (c) Images containing Chandra Bindu in the upper zone

2.2 Cropping

The presence of Headline in languages like Hindi, Gurumukhi, and Marathi etc. makes the task of segmentation more difficult as compare to scripts without Headline line Roman script. To segment each of the constituent character of a word, various techniques of segmentation are reported in the literature using Headline removal approach and other without removing the Headline [16, 17].

In literature various Headline removal techniques in printed [18, 19] or handwritten(online) [20] is discussed based on the number of pixels present per stroke. But the same task becomes difficult if the text is handwritten (offline). Variability in handwritten text by the user can be due to varying width of stroke, writing style, pen/pencil used, mood of the writer, paper used, skew etc. Varying Headline add more complexity.

The approach for character cropping without removing the Headline is used in the paper. The reason behind using this approach is that even an individual Hindi character is written with a Headline present on it and other reason is the availability of handwritten word database used consists of characters with Headline. Segmentation approaches which consider the removal of Headline need to create database consisting of characters without Headline or adding the Headline to each character after segmentation. Such approach will increase the computation for removal and then adding of Headline and hence reduce the overall efficiency of the proposed algorithm.

In this paper, new idea is proposed to extract individual components of Devanagari word image. Headline is not considered while finding the component but considered when components are cropped in the middle zone. Initially the proposed idea is implemented using a threshold value i.e. 10% rows from the top of the image are excluded to find the connected component and included while cropping. Due to variability in width of the strokes in handwritten data, results are further improved by calculating rzone(maximum row value correspond to the header line + threshold value) value. This approach has the advantage that existing characters databases consists of characters with Headline can be used without removing Headline from the database. The result of the proposed algorithm is shown in Fig. 2.
Fig. 2.

Zoning and cropping on words containing touched characters

2.2.1 Cropping Algorithm

In handwritten data, width of the stroke is not fixed. The pixels contributing to Headline also vary in length as well as in width. In the proposed algorithm for cropping individual components, lower sub-part of the middle zone is considered. The upper sub-part of the middle zone image is not to be considered while finding connected components in the image
.

The lower sub-part is calculated by adding maximum row value correspond to the header line to threshold value.

Step 1:

Connected components are extracted from image imol

[r c] = size (imi)

imol = imi (rzone: r, 1: c)

Step 2:

For connected component for i = 1 to N

[rc cc] = size(i)

imo = crop_img(1: rc; cc)

end

Step 3:

Save imo[i] where i ≥ 0 and i ≤ N

3 Experimental Results

The performance of CPT is evaluated and verified manually. The experiment is performed in MATLAB R2009b under Microsoft Windows environment with X86 based PC, 2.40 GHz CPU and 4 GB RAM.

3.1 Database Used for Experiment

The database consists of 8480 handwritten legal amount words containing non-touching characters by 80 writers provided by ICDAR [1]. The benchmark database for touching characters word in Hindi is not available. We have prepared dataset consisting of 1590 legal amount words for touching characters by 15 writers. The database consists of Binary images. The efficiency of the CPT (Continuous Pixels technique) algorithm is verified manually on 10070 words and for cropping 530 randomly selected images out of 10070 are used.

3.2 Results

Table 1 shows the accuracies obtained with the dataset. It is observed that 98.89% accuracy is obtained using CPT (Continuous Pixels technique) and 80.94% cropping.
Table 1.

Accuracy of the proposed algorithms.

Algorithm

Words

Correctly detected

Accuracy (%)

CPT

10070

9958

98.89

Cropping

530

429

80.94

4 Conclusion and Future Scope

In this paper, technique for zoning-CPT (Continuous Pixels technique) and cropping are proposed. CPT facilitates division of upper and middle zone of handwritten Hindi words by finding the contiguous pixels in the Headline. Cropping facilitates extraction of individual components of a Devanagari word image taking into consideration its one of the major characteristic i.e. Headline. Accuracy of cropping can be further enhanced by finding addressing the solution to shadowed characters. Future work will focus on extracting individual components of the word image considering constraint like shadowed characters.

Notes

Acknowledgment

I am thankful to Jayadevan R., ICDAR for support and providing word database of offline handwritten words database in Hindi.

References

  1. 1.
    Jayadevan, R., Kolhe, S.R., Patil, P.M., Pal, U.: Database development and recognition of handwritten Devanagari legal amount words. In: Proceedings of the International Conference on Document Analysis and Recognition, ICDAR, pp. 304–308 (2011)Google Scholar
  2. 2.
    Kumar, S.: An analysis of irregularities in Devanagari script writing—a machine recognition perspective. Int. J. Comput. Sci. Eng. 2, 274–279 (2010)Google Scholar
  3. 3.
    Choudhary, A., Rishi, R., Ahlawat, S.: New character segmentation approach for off-line cursive handwritten words. Procedia Comput. Sci. 17, 88–95 (2013)Google Scholar
  4. 4.
    Elnagar, A., Alhajj, R.: Segmentation of connected handwritten numeral strings. Pattern Recognit. 36, 625–634 (2003)Google Scholar
  5. 5.
    Jayarathna, U.K.S., Bandara, G.E.M.D.C.: A junction based segmentation algorithm for offline handwritten connected character segmentation. In: International Conference on Computational Intelligence for Modelling, Control and Automation, 2006 and International Conference on Intelligent Agents, Web Technologies and Internet Commerce, p. 147 (2006)Google Scholar
  6. 6.
    Kim, K.K., Kim, J.H., Suen, C.Y.: Segmentation-based recognition of handwritten touching pairs of digits using structural features. Pattern Recognit. Lett. 23, 13–24 (2002)Google Scholar
  7. 7.
    Saba, T., Sulong, G., Rehman, A.: Non-linear segmentation of touched roman characters based on genetic algorithm. Int. J. Comput. Sci. Eng. 2, 2167–2172 (2010)Google Scholar
  8. 8.
    Reddy, L.P., Babu, T.R., Rao, N.V., Babu, B.R.: Touching syllable segmentation using split profile algorithm. Int. J. Comput. Sci. Issues (IJCSI) 7(3), 1–10 (2010)Google Scholar
  9. 9.
    Bag, S., Bhowmick, P., Harit, G., Biswas, A.: Character segmentation of handwritten Bangla text by vertex characterization of isothetic covers. In: 2011 Third National Conference on Computer Vision, Pattern Recognition, Image Processing and Graphics (NCVPRIPG), pp. 21–24 (2011)Google Scholar
  10. 10.
    Venkatesh, M., Majjagi, V., Vijayasenan, D.: Implicit segmentation of Kannada characters in offline handwriting recognition using hidden Markov models. Implicit arXiv1410.4341, pp. 1–6 (2014)
  11. 11.
    Bag, A.S., Krishna: Character segmentation of Hindi unconstrained handwritten words. In: International Workshop on Combinatorial Image Analysis, vol. 9448, pp. 247–260. Springer, Cham (2015)Google Scholar
  12. 12.
    Garg, N.K., Kaur, L., Jindal, M.K.: The hazards in segmentation of handwritten Hindi Text. Int. J. Comput. Appl. 29, 30–34 (2011)Google Scholar
  13. 13.
    Palakollu, S., Rani, R.: Handwritten Hindi text segmentation techniques for lines and characters. In: Proceedings of the World Congress on Engineering and Computer Science (2012)Google Scholar
  14. 14.
    Garg, N.K.: A new method for line segmentation of handwritten Hindi text key words. In: Seventh International Conference on Information Technology, pp. 392–397 (2010)Google Scholar
  15. 15.
    Hanmandlu, M.B.L., Agrawal, P.: Segmentation of handwritten Hindi text: a structural approach. Int. J. Comput. Proc. Languages 22(01), 1–20 (2001)Google Scholar
  16. 16.
    Bhujade, M.V.G., Meshram, M.C.M.: A technique for segmentation of handwritten Hindi text. Int. J. Eng. Res. Technol. 3, 1491–1495 (2014)Google Scholar
  17. 17.
    Ramteke, A.S., Rane, M.E.: Offline handwritten devanagari script segmentation. Int. J. Sci. Res. 1, 142–145 (2012)Google Scholar
  18. 18.
    Garain, U., Chaudhuri, B.B.: Segmentation of touching and fused Devanagari characters. Pattern Recognit 32, 449–459 (2002)Google Scholar
  19. 19.
    Bansal, V., Sinha, R.M.K.: Segmentation of touching and fused Devanagari characters. Pattern Recognit. 35, 875–893 (2002)Google Scholar
  20. 20.
    Kumar, M.: Segmentation of isolated and touching characters in offline handwritten Gurmukhi script recognition. Int. J. Inf. Technol. Comput. Sci. 2, 58–63 (2014)Google Scholar

Copyright information

© Springer Nature Singapore Pte Ltd. 2019

Authors and Affiliations

  1. 1.Department of Computer Science and ApplicationsPanjab UniversityChandigarhIndia
  2. 2.Department of Computer ApplicationsPanjab University, SSG Regional CentreHoshiarpurIndia

Personalised recommendations