Information Density Based Image Binarization for Text Document Containing Graphics

Datta, Soma; Chaki, Nabendu; Choudhury, Sankhayan

doi:10.1007/978-3-319-45378-1_10

Information Density Based Image Binarization for Text Document Containing Graphics

Soma Datta¹⁴,
Nabendu Chaki¹⁴ &
Sankhayan Choudhury¹⁴

Conference paper
First Online: 09 September 2016

2090 Accesses
1 Citations

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 9842))

Abstract

In this work, a new clustering based binarization technique has been proposed. Clustering is done depending on the information density of the input image. Here input image is considered as a set of text, images as foreground and some random noises, marks of ink, spots of oil, etc. in the background. It is often quite difficult to separate the foreground from the background based on existing binarization technique. The existing methods offer good result if the input image contains only text. Experimental results indicate that this method is particularly good for degraded text document containing graphic images as well. USC-SIPI database is used for testing phase. It is compared with iterative partitioning, Otsu’s method for seven different metrics.

You have full access to this open access chapter, Download conference paper PDF

1 Introduction

It is very important to maintain the documents and the legacy of the document. To fulfill these purpose document image processing takes a vital role. Document image binarization is usually performed in optical character recognition (OCR) [1, 2] and image searching. This involves handwriting recognition, extracting logos and pictures from a graphical image. The main purpose of document image processing [3] is reduction of paper usage, easy access to the documents with lowest storage cost. At this point the most challenging task is to segment the region of interest (ROI) for further analysis. The simplest method for image segmentation [4] is thresholding based binarization which is also an essential technique in enhancement and biomedical image analysis. The output of this process is a binary image [5]. Though researchers work upon document image binarization for several years, the thresholding of compound document images still remains a challenging task due to its sensitivity to noise, illumination, variable intensity and sometimes insufficient contrast. It has been observed that some of the existing methods [6–10] offer very good result for text document. However, the performance degrades when a degraded text document contains some graphical images in it. We refer this type of document as compound document in the rest of this paper. In this research work, we aim to devise a new segmentation methodology that would be good for the compound documents. We separate the entire image into three regions as the background, only text region and the graphical image. Our proposed method keeps a good balance both for text and graphics in the degraded compound documents. The proposed binarization method is based on cluster density information. It consists of six phases. These are noise removal with image normalization, entropy calculation, fuzzy c_mean clustering, segmenting of each region based on the clustering output, applying local threshold based binarization and finally integrating the segmented region. Each of these phases is described detail in the design methodology section.

2 Survey on Existing Techniques

Document image binarization has drawn lot of attention in the machine vision research community. Some of the highly cited methods are discussed in this section in a nutshell. Parker et al. proposed a method based on Shen-Castan edge detector to identify object pixels [7]. This method creates a surface using moving least squares method used to threshold images. Chen et al. proposed enhanced speed entropic threshold selection algorithm [8]. This method works upon the selection of global threshold value using maximin optimization procedure. O’Gorman proposed a global approach based on the measurement of information on local connectivity. The threshold values are incorporated at intensity level. Thus this method has advantages of local as well as global adaptive approaches. Liu et al. proposed a method based on grey scale and run length histogram. This method carefully handles noisy and complex background problems. Chang et al. worked upon stroke connectivity preservation issues for graphical images [11]. Their proposed algorithm is able to eliminate the background noise and enhancement of grey levels of texts. This method is used to extract the strokes from low level density as well as darker background. Shaikh et al. [12] proposed iterative partitioning method. In this method, entire input image is divided into four equal sub images if the number of peaks is greater than two in the input image. This process will continue until the sub image contains less than two histogram peaks. This binarization method is offering good result for very old, faded, stained documented images but fails for medical image segmentation. In Otsu’s [13] method, the thresholding is based on the class variance criterion and the histogram of the input image. This method segments the image into two classes, so that the total variance of different classes is maximized. Otsu’s binarization technique produces good result for graphical images; however it can’t properly binaries the old spotted document.

3 Design of New Information Density Based Binarization Technique

3.1 Image Acquisition and Enhancement

The design methodology follows a pipelined approach starting with image acquisition and ending with binarization. Histogram based Otsu’s method may provide satisfactory result when documented images are clear. However in reality the old and multiple times photocopied documents are not so good. Hence these documents are often not binarized properly using Otsu’s method.

Otsu’s method is a histogram based generalized binarization. Information density based segmentation is not done here. The algorithm assumes that the input image contains foreground and background pixels and it then calculates the optimal threshold that separates the two regions. USC-SIPI database [14] is used for testing phase. The quality and size of the original image is not changed. Instead of this database we have also tested our proposed method with some sample documented scanned images that consists of text and graphical images (Fig. 1).

Image enhancement improves the visual quality of the input image for further processing techniques. The qualities of some images are not so good. Most of the image contains speckle noise, salt and pepper noises and some random noises. Here Wiener tilter is used to remove that kind of mixture noises. This filter is a linear filter to remove additive noise and blurring of the images. It offers good result to reduce the mean square error rate. These filters are often applied in the frequency domain. The corresponding output image is $W(f_1, f_2)$ as follows:

$$\begin{aligned} W(f_1, f_2)=\frac{H^{*}(f_1, f_2) S_x (f_1, f_2)}{|H(f_1, f_2)|^{2}*S_x (f_1, f_2)+S_y (f_1, f_2)} \end{aligned}$$

(1)

Here, $S_x (f_1, f_2)$ and $S_y (f_1, f_2)$ represent the power spectrum of the original image and the noisy image respectively and $H(f_1, f_2)$ means the blurred filter. The Wiener filter performs deconvolution during minimization of least square error as follows,

$$\begin{aligned} e_2=E\{(f-\widehat{f})^2\} \end{aligned}$$

(2)

Here E is the mean value and f is the un-degraded image.

3.2 Convert Input Color Image into Gray Scale Image

Color of a pixel is represented as the combination of chrominance and luminance. Chrominance is the color components of the input image and luminance is the intensity. This intensity is calculated as the weighted means of red, green and blue (RGB) component. Now image is in three dimensional that requires a massive computational time. Hence RGB images are converted into gray scale image using NTSC color format [2].

3.3 Intensity Value Normalization

Intensity normalization is very important towards handling the variable light intensity. Basically it is a method that maps the intensity values as per prerequisite. Normalization transforms an n dimensional grayscale image represented as, $Img = \{X \subseteq R^n\}\rightarrow \{Min, Max\}$. Image Img consist the intensity values in between Min and Max. This image is converted into a new image, $Img = \{X \subseteq R^n\}\rightarrow \{ {new\_Min ,\ldots ., new\_Max }\}$. New image Img consist the intensity values in between new_Min and new_Max. Normalization is done by using the histogram. The following steps are made for normalization.

3.4 Find the Most Informative Regions

Input images may contain the actual information along with some distortion due to oil ink etc. Hence, it is very much important to segment the actual informative region. In order to find the most informative region, texture based information have been used. Among many texture properties, entropy is used to do the needful. Entropy [15, 16] refers to the disorder, uncertainty or randomness of the given dataset. The following algorithm is used to find the most informative regions. The covariance or probability of randomness is higher in the text area but it is less in non text area.

3.5 Segment Individual Non-overlapped Regions

The next step is to segment the individual non-overlapped regions. Clustering method [17, 18] has been applied to find out different cluster seed points that are shown in Fig. 2a. Here three cluster points have been found. Now our target is to segment each individual clustering region as shown in Fig. 2b, c and d. The next step is to segment each of the regions.

$$\begin{aligned} Single\_Region\_Set = \{P: P\,is\,a\,subset\,of\,Region\_point\_Set\} \end{aligned}$$

Elements of set P form a vector containing row and column number of a pixel. P contains the coordinates of all pixels of a single region. There is no common element between any two elements of $Single\_Region\_Set$.

$P_1 \cup P_2 \cup P_3 \cup P_n = Region\_point\_set$ Where $P_1, P_2, ,P_n $ are all elements of $Single\_Region\_Set $, n is the total no of elements in $ Single\_Region\_Set $. $P_i \cap P_j = \phi $ where $P_i, P_j$ are any two elements of $Single\_Region\_Set $ and value of i, j may be 1, 2, 3..n but $ i \ne j $. Each component of $Single\_Region\_Set$ contains the coordinates of a single contour, and they are found by applying procedure segment each region.

After applying segment each region algorithm, the wanted regions are segmented. Now apply global binarization technique on the segmented regions. Threshold value for binarization is calculated as follows [19].

4 Performance Analysis

For performance analysis some degraded document images are used that are collected from multiple sites. Our method is tested on the USC-SIPI database arbitrarily for experimental verification purpose. The results are shown in Fig. 3.

In this work, we have compared seven different metrics in between Otsu’s method, iterative partitioning and our proposed method. The seven metrics are recall, precision, F_measure, PERR or pixel error rate, MSE or mean squared error, SNR or signal to noise ratio and peak signal to noise (PSNR) [12, 20]. Theoretically recall is the probability of a relevant document is retrieved during search and precision is the probability of whether a retrieved document is relevant or not. However in both of these methods, some text region has been lost. In our method, none of the text region has lost and only the black shirt comes as white due to no change of entropy of that region. Iterative partitioning offers poor result in both the cases as it partitions the image into four sub block based on the histogram. The Table 1 shows the average performance analysis between Otsu’s method, iterative partitioning and our proposed method.

Performance analysis table shows that the probability of retrieval of relevant document is 99.5 % and the precision is 96.45 % which are quite good. On the other hand misclassification rate is also less than Otsu’s method and iterative partitioning. Our method performs much better than the other two methods in the presence of Gaussian and random noises. It makes obvious that the information density based binarization technique is more noise resistive than other two binarization techniques.

Table 1. Performance analysis

Full size table

5 Conclusion

In this research work, we have proposed a new method using clustering technique of a gray scale image. Here random noises, pepper and salt noises, Gaussian noises are removed. The proposed method has been tested on the benchmarked image data-base. This method easily separates the compound document and produce better result than Otsu’s method, iterative partitioning. The proposed method also offers good result for the above said evaluation metrics. However, experimental observation finds limitation of the proposed method towards binarization of X-ray type of medical images. The work may be extended to address this aspect.

References

Thillou, C., Gosselin, B.: Segmentation-based binarization for color degraded images. In: Wojciechowski, K., Smolka, B., Palus, H., Kozera, R.S., Skarbek, W., Noakes, L. (eds.) Computer Vision and Graphics, pp. 808–813. Springer, Heidelberg (2006)
Chapter Google Scholar
Gonzalez, R.C., Woods, R.E.: Digital Image Processing. Pearson Education India, New Delhi (2009)
Google Scholar
Namboodiri, A.M., et al.: Document structure and layout analysis. In: Chaudhuri, B.B. (ed.) Digital Document Processing. Springer, London (2007)
Google Scholar
Dinan, R.F., Dubil, J.F., Malin, J.R., Rodite, R.R., Rohe, C.F., Rohrer, G.D.: Document image processing system. US Patent 4,888,812, 19 December 1989
Google Scholar
Jaimes, A., Mintzer, F.C., Rao, A.R., Thompson, G.: Segmentation and automatic descreening of scanned documents. In: Electronic Imaging 1999, International Society for Optics and Photonics, pp. 517–528 (1998)
Google Scholar
Ghosh, P., Bhattacharjee, D., Nasipuri, M.: Blood smear analyzer for white blood cell counting: a hybrid microscopic image analyzing technique. Appl. Soft Comput. 46, 629–638 (2016)
Article Google Scholar
Parker, J.R., Jennings, C., Salkauskas, A.G.: Thresholding using an illumination model. In: Proceedings of the Second International Conference on Document Analysis and Recognition, 1993, pp. 270–273. IEEE (1993)
Google Scholar
Chen, W.T., Wen, C.H., Yang, C.W.: A fast two-dimensional entropic thresholding algorithm. Pattern Recogn. 27(7), 885–893 (1994)
Article Google Scholar
Yanowitz, S.D., Bruckstein, A.M.: A new method for image segmentation. In: 9th International Conference on Pattern Recognition, 1988, pp. 270–275. IEEE (1988)
Google Scholar
Ghosh, P., Bhattacharjee, D., Nasipuri, M., Basu, D.K.: Medical aid for automatic detection of malaria. In: Chaki, N., Cortesi, A. (eds.) CISIM 2011. CCIS, vol. 245, pp. 170–178. Springer, Heidelberg (2011)
Chapter Google Scholar
Yang, J.D., Chen, Y.S., Hsu, W.H.: Adaptive thresholding algorithm and its hardware implementation. Pattern Recogn. Lett. 15(2), 141–150 (1994)
Article Google Scholar
Shaikh, S.H., Maiti, A.K., Chaki, N.: A new image binarization method using iterative partitioning. Mach. Vis. Appl. 24(2), 337–350 (2013)
Article Google Scholar
Otsu, N.: A threshold selection method from gray-level histograms. Automatica 11(285–296), 23–27 (1975)
Google Scholar
California, S.: USC-SIPI image database, University of Southern California. http://sipi.usc.edu/database/
Jain, A.K.: Fundamentals of Digital Image Processing. Prentice-Hall Inc., Upper Saddle River (1989)
MATH Google Scholar
Abutaleb, A.S.: Automatic thresholding of gray-level pictures using two-dimensional entropy. Comput. Vis. Graph. Image Process. 47(1), 22–32 (1989)
Article Google Scholar
Datta, S., Chaki, N.: Person identification technique using RGB based dental images. In: Saeed, K., Homenda, W. (eds.) CISIM 2015. LNCS, vol. 9339, pp. 169–180. Springer, Heidelberg (2015)
Chapter Google Scholar
Han, Y., Shi, P.: An improved ant colony algorithm for fuzzy clustering in image segmentation. Neurocomputing 70(4), 665–671 (2007)
Article Google Scholar
Chaki, N., Shaikh, S.H., Saeed, K.: Exploring Image Binarization Techniques. SCI, vol. 560. Springer, New Delhi (2014)
Google Scholar
Su, B., Lu, S., Tan, C.L.: Binarization of historical document images using the local maximum and minimum. In: Proceedings of the 9th IAPR International Workshop on Document Analysis Systems, pp. 159–166. ACM (2010)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science and Engineering, University of Calcutta, JD2 Block, Sector III, Saltlake, Kolkata, India
Soma Datta, Nabendu Chaki & Sankhayan Choudhury

Authors

Soma Datta
View author publications
You can also search for this author in PubMed Google Scholar
Nabendu Chaki
View author publications
You can also search for this author in PubMed Google Scholar
Sankhayan Choudhury
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Soma Datta .

Editor information

Editors and Affiliations

Bialystok University of Technology , Bialystok, Poland
Khalid Saeed
University of Bialystok , Vilnius, Lithuania
Władysław Homenda

Rights and permissions

Open Access This chapter is licensed under the terms of the Creative Commons Attribution-NonCommercial 2.5 International License (http://creativecommons.org/licenses/by-nc/2.5/), which permits any noncommercial use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

Reprints and permissions

Copyright information

About this paper

Cite this paper

Datta, S., Chaki, N., Choudhury, S. (2016). Information Density Based Image Binarization for Text Document Containing Graphics. In: Saeed, K., Homenda, W. (eds) Computer Information Systems and Industrial Management. CISIM 2016. Lecture Notes in Computer Science(), vol 9842. Springer, Cham. https://doi.org/10.1007/978-3-319-45378-1_10

Download citation

DOI: https://doi.org/10.1007/978-3-319-45378-1_10
Published: 09 September 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-45377-4
Online ISBN: 978-3-319-45378-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The International Federation for Information Processing (opens in a new tab)