Improved Zoning and Cropping Techniques Facilitating Segmentation
In the advent of digital computers and era where work force is shifted to be inclined on robotic process, Optical Character Recognition (OCR) has immense potentials to ease some these processes. Segmentation is one of the pre-processing phases- the pivotal essence of the process where lingual scripts and their characteristics vary to a much larger extent. This paper focuses on techniques which facilitates segmentation in Devanagari script (Hindi) for offline handwritten words i.e. Headline detection in handwritten word images of Hindi for extracting upper and middle zone characters and cropping. Experiments are performed on the handwritten legal amount words ICDAR database  on 106 words by 80 writers and on Self created touching character database on 106 words by 15 writers. The proposed zoning technique i.e. CPT (Continuous pixel technique) and cropping techniques is implemented on 10070 and 530 legal amount words with 98.89% accuracy and 80.94% respectively.
KeywordsHandwritten data Optical character recognition (OCR) Segmentation Zoning
The prominence of utility of mother tongue has been increasing in past half decade, has opened opportunities for tools inclined in providing functionality of OCR. Thus the process involved in recognizing native linguistic characters is a major challenge that has been prevailing for more than a decade. The research involves exploring phases for optical character recognition is pre-processing & segmentation, feature extraction and recognition. The recognition accuracy of character can be enhanced by applying improved pre-processing & segmentation technique to extract appropriate features.
Moreover, recognition of handwritten words is complex activity where achieving accuracy for multilingual script is more challenging. Offline handwritten data adds more complication and reduced recognition accuracy as compare to online data due to varying stroke width, writing style, pen/pencil used, mood of the writer, paper used, skew.
According to the literature survey, improper segmentation in handwritten data is a major topic of research. S. Kumar  discussed various irregularities in Devanagari script. Number of papers are available dealing with segmentation problem in Roman characters/numerals [3, 4, 5, 6, 7] and in other Indian languages[8, 9, 10] but very few are available in Devanagari script (Hindi) [11, 12].
Zoning is required to divide words into upper zone, middle zone and lower zone. Literature survey has shown that Horizontal Profile, Vertical profile, Hough transform are used for zoning.
Due to the presence of large character set in Devanagari script (Hindi), segmentation becomes more complicated. Character set in Hindi not only consists of characters but includes vowels, consonants, conjuncts, compound characters, modifiers. The paper is an attempt to propose techniques for zoning (CPT), to detect Headline and perform zoning. The paper is organized as follows. Section 2 gives the basic technique used in the literature for pre-processing, formulation of Headline detection for zoning to extract middle and upper zone characters, cropping. Section 3 introduces the database used and experimental results. Conclusion and future scope is given in Sect. 4.
Character segmentation is an approach which decomposes an image consisting sequence of characters into sub-images or sub-units for recognition. Pre-processing consists of scanning the handwritten document and processes it to Binarization, remove noise, skew, smoothing the image, Headline detection/removal, zoning which aid segmentation. Variations in the handwritten data necessitate addressing various intricacies of handwritten text.
Detection of Headline is a challenging task in handwritten text. Soumen et al.  used thinning and global max density of a row for removal of Headline with accuracy of 95.45% on 11550 words. Detection of header line is performed after straightening is proposed in . Garg et al.  used two-stripe projection for Headline detection. Contour-tracing used in  used structural approach for the detection of Headline by finding the maximum number of pixels in a row. The detection of Headline by finding the row with maximum pixel density is widely used by authors but fails for skew variable text.
In literature, Hough Transform is used for skew detection and line segmentation. Continuous Pixel Technique (CPT) - based on Hough Transform function is used for Headline line detection is proposed in this paper which facilitates zoning. The proposed algorithm eliminate the problem due to unusual size of upper modifier.
Formal definition of Line detection is given as:
The function returns an accumulator array based on the threshold value which determines minimum number of pixels that belong to line in an image space. θ defines the angle of detected lines in the polar coordinates system. It gives set of all straight line at a single point in plane corresponds to a sinusoidal curve which is unique to that point. Line detection and zoning using CPT (Continuous Pixel technique) is given in Sect. 2.1.
2.1 Zoning - CPT (Continuous Pixels Technique)
Devanagari script consists of core character in the middle strip and optional modifiers above and below the core characters. Characters form a word when they are joined by a Headline (‘shirorekha’). The purpose of zoning is to extract middle, upper and lower components. The proposed technique for zoning is applied on Binarized, smoothened image.
Step1. Contiguous pixel with maximum length is calculated using Houghline detection algorithm.
Step2. Threshold of value 20 is used for finding Houghlines.
Step3. The result of the step I and step II, results a row connecting common row between the upper and the middle zone of the word.
The presence of Headline in languages like Hindi, Gurumukhi, and Marathi etc. makes the task of segmentation more difficult as compare to scripts without Headline line Roman script. To segment each of the constituent character of a word, various techniques of segmentation are reported in the literature using Headline removal approach and other without removing the Headline [16, 17].
In literature various Headline removal techniques in printed [18, 19] or handwritten(online)  is discussed based on the number of pixels present per stroke. But the same task becomes difficult if the text is handwritten (offline). Variability in handwritten text by the user can be due to varying width of stroke, writing style, pen/pencil used, mood of the writer, paper used, skew etc. Varying Headline add more complexity.
The approach for character cropping without removing the Headline is used in the paper. The reason behind using this approach is that even an individual Hindi character is written with a Headline present on it and other reason is the availability of handwritten word database used consists of characters with Headline. Segmentation approaches which consider the removal of Headline need to create database consisting of characters without Headline or adding the Headline to each character after segmentation. Such approach will increase the computation for removal and then adding of Headline and hence reduce the overall efficiency of the proposed algorithm.
2.2.1 Cropping Algorithm
The lower sub-part is calculated by adding maximum row value correspond to the header line to threshold value.
- Step 1:
Connected components are extracted from image imol
[r c] = size (imi)
imol = imi (rzone: r, 1: c)
- Step 2:
For connected component for i = 1 to N
[rc cc] = size(i)
imo = crop_img(1: rc; cc)
- Step 3:
Save imo[i] where i ≥ 0 and i ≤ N
3 Experimental Results
The performance of CPT is evaluated and verified manually. The experiment is performed in MATLAB R2009b under Microsoft Windows environment with X86 based PC, 2.40 GHz CPU and 4 GB RAM.
3.1 Database Used for Experiment
The database consists of 8480 handwritten legal amount words containing non-touching characters by 80 writers provided by ICDAR . The benchmark database for touching characters word in Hindi is not available. We have prepared dataset consisting of 1590 legal amount words for touching characters by 15 writers. The database consists of Binary images. The efficiency of the CPT (Continuous Pixels technique) algorithm is verified manually on 10070 words and for cropping 530 randomly selected images out of 10070 are used.
Accuracy of the proposed algorithms.
4 Conclusion and Future Scope
In this paper, technique for zoning-CPT (Continuous Pixels technique) and cropping are proposed. CPT facilitates division of upper and middle zone of handwritten Hindi words by finding the contiguous pixels in the Headline. Cropping facilitates extraction of individual components of a Devanagari word image taking into consideration its one of the major characteristic i.e. Headline. Accuracy of cropping can be further enhanced by finding addressing the solution to shadowed characters. Future work will focus on extracting individual components of the word image considering constraint like shadowed characters.
I am thankful to Jayadevan R., ICDAR for support and providing word database of offline handwritten words database in Hindi.
- 1.Jayadevan, R., Kolhe, S.R., Patil, P.M., Pal, U.: Database development and recognition of handwritten Devanagari legal amount words. In: Proceedings of the International Conference on Document Analysis and Recognition, ICDAR, pp. 304–308 (2011)Google Scholar
- 2.Kumar, S.: An analysis of irregularities in Devanagari script writing—a machine recognition perspective. Int. J. Comput. Sci. Eng. 2, 274–279 (2010)Google Scholar
- 3.Choudhary, A., Rishi, R., Ahlawat, S.: New character segmentation approach for off-line cursive handwritten words. Procedia Comput. Sci. 17, 88–95 (2013)Google Scholar
- 4.Elnagar, A., Alhajj, R.: Segmentation of connected handwritten numeral strings. Pattern Recognit. 36, 625–634 (2003)Google Scholar
- 5.Jayarathna, U.K.S., Bandara, G.E.M.D.C.: A junction based segmentation algorithm for offline handwritten connected character segmentation. In: International Conference on Computational Intelligence for Modelling, Control and Automation, 2006 and International Conference on Intelligent Agents, Web Technologies and Internet Commerce, p. 147 (2006)Google Scholar
- 6.Kim, K.K., Kim, J.H., Suen, C.Y.: Segmentation-based recognition of handwritten touching pairs of digits using structural features. Pattern Recognit. Lett. 23, 13–24 (2002)Google Scholar
- 7.Saba, T., Sulong, G., Rehman, A.: Non-linear segmentation of touched roman characters based on genetic algorithm. Int. J. Comput. Sci. Eng. 2, 2167–2172 (2010)Google Scholar
- 8.Reddy, L.P., Babu, T.R., Rao, N.V., Babu, B.R.: Touching syllable segmentation using split profile algorithm. Int. J. Comput. Sci. Issues (IJCSI) 7(3), 1–10 (2010)Google Scholar
- 9.Bag, S., Bhowmick, P., Harit, G., Biswas, A.: Character segmentation of handwritten Bangla text by vertex characterization of isothetic covers. In: 2011 Third National Conference on Computer Vision, Pattern Recognition, Image Processing and Graphics (NCVPRIPG), pp. 21–24 (2011)Google Scholar
- 10.Venkatesh, M., Majjagi, V., Vijayasenan, D.: Implicit segmentation of Kannada characters in offline handwriting recognition using hidden Markov models. Implicit arXiv1410.4341, pp. 1–6 (2014)
- 11.Bag, A.S., Krishna: Character segmentation of Hindi unconstrained handwritten words. In: International Workshop on Combinatorial Image Analysis, vol. 9448, pp. 247–260. Springer, Cham (2015)Google Scholar
- 12.Garg, N.K., Kaur, L., Jindal, M.K.: The hazards in segmentation of handwritten Hindi Text. Int. J. Comput. Appl. 29, 30–34 (2011)Google Scholar
- 13.Palakollu, S., Rani, R.: Handwritten Hindi text segmentation techniques for lines and characters. In: Proceedings of the World Congress on Engineering and Computer Science (2012)Google Scholar
- 14.Garg, N.K.: A new method for line segmentation of handwritten Hindi text key words. In: Seventh International Conference on Information Technology, pp. 392–397 (2010)Google Scholar
- 15.Hanmandlu, M.B.L., Agrawal, P.: Segmentation of handwritten Hindi text: a structural approach. Int. J. Comput. Proc. Languages 22(01), 1–20 (2001)Google Scholar
- 16.Bhujade, M.V.G., Meshram, M.C.M.: A technique for segmentation of handwritten Hindi text. Int. J. Eng. Res. Technol. 3, 1491–1495 (2014)Google Scholar
- 17.Ramteke, A.S., Rane, M.E.: Offline handwritten devanagari script segmentation. Int. J. Sci. Res. 1, 142–145 (2012)Google Scholar
- 18.Garain, U., Chaudhuri, B.B.: Segmentation of touching and fused Devanagari characters. Pattern Recognit 32, 449–459 (2002)Google Scholar
- 19.Bansal, V., Sinha, R.M.K.: Segmentation of touching and fused Devanagari characters. Pattern Recognit. 35, 875–893 (2002)Google Scholar
- 20.Kumar, M.: Segmentation of isolated and touching characters in offline handwritten Gurmukhi script recognition. Int. J. Inf. Technol. Comput. Sci. 2, 58–63 (2014)Google Scholar