1 Introduction

It is very important to maintain the documents and the legacy of the document. To fulfill these purpose document image processing takes a vital role. Document image binarization is usually performed in optical character recognition (OCR) [1, 2] and image searching. This involves handwriting recognition, extracting logos and pictures from a graphical image. The main purpose of document image processing [3] is reduction of paper usage, easy access to the documents with lowest storage cost. At this point the most challenging task is to segment the region of interest (ROI) for further analysis. The simplest method for image segmentation [4] is thresholding based binarization which is also an essential technique in enhancement and biomedical image analysis. The output of this process is a binary image [5]. Though researchers work upon document image binarization for several years, the thresholding of compound document images still remains a challenging task due to its sensitivity to noise, illumination, variable intensity and sometimes insufficient contrast. It has been observed that some of the existing methods [610] offer very good result for text document. However, the performance degrades when a degraded text document contains some graphical images in it. We refer this type of document as compound document in the rest of this paper. In this research work, we aim to devise a new segmentation methodology that would be good for the compound documents. We separate the entire image into three regions as the background, only text region and the graphical image. Our proposed method keeps a good balance both for text and graphics in the degraded compound documents. The proposed binarization method is based on cluster density information. It consists of six phases. These are noise removal with image normalization, entropy calculation, fuzzy c_mean clustering, segmenting of each region based on the clustering output, applying local threshold based binarization and finally integrating the segmented region. Each of these phases is described detail in the design methodology section.

2 Survey on Existing Techniques

Document image binarization has drawn lot of attention in the machine vision research community. Some of the highly cited methods are discussed in this section in a nutshell. Parker et al. proposed a method based on Shen-Castan edge detector to identify object pixels [7]. This method creates a surface using moving least squares method used to threshold images. Chen et al. proposed enhanced speed entropic threshold selection algorithm [8]. This method works upon the selection of global threshold value using maximin optimization procedure. O’Gorman proposed a global approach based on the measurement of information on local connectivity. The threshold values are incorporated at intensity level. Thus this method has advantages of local as well as global adaptive approaches. Liu et al. proposed a method based on grey scale and run length histogram. This method carefully handles noisy and complex background problems. Chang et al. worked upon stroke connectivity preservation issues for graphical images [11]. Their proposed algorithm is able to eliminate the background noise and enhancement of grey levels of texts. This method is used to extract the strokes from low level density as well as darker background. Shaikh et al. [12] proposed iterative partitioning method. In this method, entire input image is divided into four equal sub images if the number of peaks is greater than two in the input image. This process will continue until the sub image contains less than two histogram peaks. This binarization method is offering good result for very old, faded, stained documented images but fails for medical image segmentation. In Otsu’s [13] method, the thresholding is based on the class variance criterion and the histogram of the input image. This method segments the image into two classes, so that the total variance of different classes is maximized. Otsu’s binarization technique produces good result for graphical images; however it can’t properly binaries the old spotted document.

3 Design of New Information Density Based Binarization Technique

3.1 Image Acquisition and Enhancement

The design methodology follows a pipelined approach starting with image acquisition and ending with binarization. Histogram based Otsu’s method may provide satisfactory result when documented images are clear. However in reality the old and multiple times photocopied documents are not so good. Hence these documents are often not binarized properly using Otsu’s method.

Otsu’s method is a histogram based generalized binarization. Information density based segmentation is not done here. The algorithm assumes that the input image contains foreground and background pixels and it then calculates the optimal threshold that separates the two regions. USC-SIPI database [14] is used for testing phase. The quality and size of the original image is not changed. Instead of this database we have also tested our proposed method with some sample documented scanned images that consists of text and graphical images (Fig. 1).

Fig. 1.
figure 1

Block diagram of proposed method

Image enhancement improves the visual quality of the input image for further processing techniques. The qualities of some images are not so good. Most of the image contains speckle noise, salt and pepper noises and some random noises. Here Wiener tilter is used to remove that kind of mixture noises. This filter is a linear filter to remove additive noise and blurring of the images. It offers good result to reduce the mean square error rate. These filters are often applied in the frequency domain. The corresponding output image is \(W(f_1, f_2)\) as follows:

$$\begin{aligned} W(f_1, f_2)=\frac{H^{*}(f_1, f_2) S_x (f_1, f_2)}{|H(f_1, f_2)|^{2}*S_x (f_1, f_2)+S_y (f_1, f_2)} \end{aligned}$$
(1)

Here, \(S_x (f_1, f_2)\) and \(S_y (f_1, f_2)\) represent the power spectrum of the original image and the noisy image respectively and \(H(f_1, f_2)\) means the blurred filter. The Wiener filter performs deconvolution during minimization of least square error as follows,

$$\begin{aligned} e_2=E\{(f-\widehat{f})^2\} \end{aligned}$$
(2)

Here E is the mean value and f is the un-degraded image.

3.2 Convert Input Color Image into Gray Scale Image

Color of a pixel is represented as the combination of chrominance and luminance. Chrominance is the color components of the input image and luminance is the intensity. This intensity is calculated as the weighted means of red, green and blue (RGB) component. Now image is in three dimensional that requires a massive computational time. Hence RGB images are converted into gray scale image using NTSC color format [2].

3.3 Intensity Value Normalization

Intensity normalization is very important towards handling the variable light intensity. Basically it is a method that maps the intensity values as per prerequisite. Normalization transforms an n dimensional grayscale image represented as, \(Img = \{X \subseteq R^n\}\rightarrow \{Min, Max\}\). Image Img consist the intensity values in between Min and Max. This image is converted into a new image, \(Img = \{X \subseteq R^n\}\rightarrow \{ {new\_Min ,\ldots ., new\_Max }\}\). New image Img consist the intensity values in between new_Min and new_Max. Normalization is done by using the histogram. The following steps are made for normalization.

figure a

3.4 Find the Most Informative Regions

Input images may contain the actual information along with some distortion due to oil ink etc. Hence, it is very much important to segment the actual informative region. In order to find the most informative region, texture based information have been used. Among many texture properties, entropy is used to do the needful. Entropy [15, 16] refers to the disorder, uncertainty or randomness of the given dataset. The following algorithm is used to find the most informative regions. The covariance or probability of randomness is higher in the text area but it is less in non text area.

figure b
figure c

3.5 Segment Individual Non-overlapped Regions

The next step is to segment the individual non-overlapped regions. Clustering method [17, 18] has been applied to find out different cluster seed points that are shown in Fig. 2a. Here three cluster points have been found. Now our target is to segment each individual clustering region as shown in Fig. 2b, c and d. The next step is to segment each of the regions.

$$\begin{aligned} Single\_Region\_Set = \{P: P\,is\,a\,subset\,of\,Region\_point\_Set\} \end{aligned}$$
Fig. 2.
figure 2

(a) After applying clustering technique. (b) First most informative region whose cluster centre is 74.1604. (c) Second most informative region whose cluster centre is 167.7177. (d) Third most informative region whose cluster centre is 225.7846

Elements of set P form a vector containing row and column number of a pixel. P contains the coordinates of all pixels of a single region. There is no common element between any two elements of \(Single\_Region\_Set\).

\(P_1 \cup P_2 \cup P_3 \cup P_n = Region\_point\_set\) Where \(P_1, P_2, ,P_n \) are all elements of \(Single\_Region\_Set \), n is the total no of elements in \( Single\_Region\_Set \). \(P_i \cap P_j = \phi \) where \(P_i, P_j\) are any two elements of \(Single\_Region\_Set \) and value of i, j may be 1, 2, 3..n but \( i \ne j \). Each component of \(Single\_Region\_Set\) contains the coordinates of a single contour, and they are found by applying procedure segment each region.

figure d

After applying segment each region algorithm, the wanted regions are segmented. Now apply global binarization technique on the segmented regions. Threshold value for binarization is calculated as follows [19].

figure e
Fig. 3.
figure 3

Comparative study

Fig. 4.
figure 4

Average performance graph on 7 matrices for the said methods

4 Performance Analysis

For performance analysis some degraded document images are used that are collected from multiple sites. Our method is tested on the USC-SIPI database arbitrarily for experimental verification purpose. The results are shown in Fig. 3.

In this work, we have compared seven different metrics in between Otsu’s method, iterative partitioning and our proposed method. The seven metrics are recall, precision, F_measure, PERR or pixel error rate, MSE or mean squared error, SNR or signal to noise ratio and peak signal to noise (PSNR) [12, 20]. Theoretically recall is the probability of a relevant document is retrieved during search and precision is the probability of whether a retrieved document is relevant or not. However in both of these methods, some text region has been lost. In our method, none of the text region has lost and only the black shirt comes as white due to no change of entropy of that region. Iterative partitioning offers poor result in both the cases as it partitions the image into four sub block based on the histogram. The Table 1 shows the average performance analysis between Otsu’s method, iterative partitioning and our proposed method.

Performance analysis table shows that the probability of retrieval of relevant document is 99.5 % and the precision is 96.45 % which are quite good. On the other hand misclassification rate is also less than Otsu’s method and iterative partitioning. Our method performs much better than the other two methods in the presence of Gaussian and random noises. It makes obvious that the information density based binarization technique is more noise resistive than other two binarization techniques.

Table 1. Performance analysis

5 Conclusion

In this research work, we have proposed a new method using clustering technique of a gray scale image. Here random noises, pepper and salt noises, Gaussian noises are removed. The proposed method has been tested on the benchmarked image data-base. This method easily separates the compound document and produce better result than Otsu’s method, iterative partitioning. The proposed method also offers good result for the above said evaluation metrics. However, experimental observation finds limitation of the proposed method towards binarization of X-ray type of medical images. The work may be extended to address this aspect.