Keywords

1 Introduction

In image analysis, a histogram is a graphical representation of the pixel distribution that describes the amount or frequency of different image intensity values.

When object classification is performed using histograms, the underlying model takes into account only the color of the object and ignores its shape and texture. It is also important to mention that a histogram does not contain spatial information about its corresponding image, i.e., the image cannot be recovered from the histogram and two different images can have the same histograms. Therefore, histograms can be latently identical in two different images, containing different objects but sharing color information. In other words, if there is no spatial or shape information, related objects of different colors may be identified as identical when comparing only the color histograms. Despite the difficulties, solutions like color histogram intersections, indexing constant color, cumulative color histograms, and color distances are used to compare images. While there are drawbacks of using histograms for image indexing and classification, employing real-time color in these tasks has several advantages. One of the advantages is that information is faster to compute compared to other approaches, and it has been shown that color-based methods can be effective in identifying objects of known location and appearance.

There are studies that relate color histogram data with physical properties of objects in an image [1]. These studies have shown that physical properties may represent not only luminescence and color of an object but also image geometry and roughness, all together provide a better estimate of object luminescence and color. Different solutions to the issues associated with comparing color histograms are proposed in the literature, for example, Distance Measure. Among the most utilized measures of distance to calculate the degree of similarity between images are: Euclidean distance, histogram intersection, and quadratic distance. In addition, calculating correlation coefficients is applied. There are many papers discussing distance measures, we found, specifically, two studies in the context of histogram comparison for image analysis tasks. The first paper is a comparative study of histogram distances for object identification by Marín [2], whereas the second paper is itself entitled “On measuring the distance between histograms” by Cha et al. [3]. These two papers served as a basis for our study in which the aim is to compare distance measures for calculating the similarity between histograms for image analysis tasks. We also improve the accuracy of some distance measures when indeterminations were found.

In summary, this paper presents a comparative analysis of some of the most popular techniques for measuring distances between histograms. Modifications to the distances with indeterminations are also proposed. These modified algorithms can be employed in tasks such as pattern recognition, feature selection, image classification, grouping, identification, indexing, and retrieval.

2 Distances

2.1 Distances Between Histograms

Generally, a distance can be defined as a numerical metric that defines the shortest line between two points. The distance between two histograms A and B can be defined as a mathematical function that meets the following conditions:

  1. (a)

    Non-negativity: \( d\left( {A,B} \right) \ge 0 \), where \( d\left( {A,B} \right) = 0 \leftrightarrow A = B \);

  2. (b)

    Symmetry: \( d\left( {A,B} \right) = d\left( {B,A} \right) \);

  3. (c)

    Triangular inequality: \( d\left( {A,C} \right) \le d\left( {A,B} \right) + d\left( {B,C} \right) \).

Two types of measures are used to calculate distances between histograms. One is called bin to bin; it compares corresponding bins in each of the two histograms one by one (i.e., the first bin of one histogram with the first bin of another one, and so on). The second type of distance measure is called cross-bin; it focuses on the bins adjacent to the one considered. We used the bin to bin measure, where each histogram bin is treated in an independent way, and distances can be calculated from additions and averages.

The definitions of the six different distance measures employed in this study are introduced below.

Bhattacharyya.

Bhattacharyya distance is used to assess equality between two distributions; the response represents the nearest distance between them. The equation for the distance is given by [4] as follows:

$$ d\left( {H_{1} ,H_{2} } \right) = - { \ln }\left( {BC\left( {H_{1} ,H_{2} } \right)} \right) $$
(1)
$$ BC\left( {H_{1} ,H_{2} } \right) = \sum\nolimits_{I = 0}^{N - 1} {\sqrt {H_{1} \left( I \right) *H_{2} \left( I \right)} } $$
(2)

where \( BC\left( {H1,H2} \right) \) is the Bhattacharyya coefficient for discrete probability distributions and N is the number of bins, usually 256, \( H_{1} \,{\text{and}}\, H_{2} \) are the first and the second histograms respectively, and \( \overline{{H_{1} }} , {\text{and}}\,\, \overline{{H_{2} }} \) represent their means calculated as

$$ \overline{{H_{K} }} = \frac{1}{N}\sum\nolimits_{J = 0}^{N - 1} {H_{k} \left( J \right) } $$
(3)

Chi-square.

Chi-square distance is a statistical measure that compares observed and expected values for a data set. It is defined by the following expression [5]:

$$ d\left( {H_{1} , H_{2} } \right) = \sum\nolimits_{I = 0}^{N - 1} {\frac{{\left( {H_{1} \left( I \right) - H_{2} \left( I \right)} \right)^{2} }}{{H_{1} \left( I \right)}}} $$
(4)

Correlation.

Correlation is a measure of describing the degree of linear dependence between two histograms. Its value varies between −1 and +1. If the result is zero, it means that there is no linear association between the two histograms being compared. It is calculated as follows [5]:

$$ d\left( {H_{1} , H_{2} } \right) = \frac{{\mathop \sum \nolimits_{I = 0}^{N - 1} \left( {H_{1} \left( I \right) - \overline{{H_{1} }} } \right)\left( {H_{2} \left( I \right) - \overline{{H_{2} }} } \right)}}{{\sqrt {\mathop \sum \nolimits_{I = 0}^{N - 1} \left( {H_{1} \left( I \right) - \overline{{H_{1} }} } \right)^{2} * \mathop \sum \nolimits_{I = 0}^{N - 1} \left( {H_{2} \left( I \right) - \overline{{H_{2} }} } \right)^{2} } }} $$
(5)

Intersection.

Intersection metric is a measure that considers the intersection of two histograms and tells how many gray levels from the first histogram are present in the second one. The equation is provided below [5]:

$$ d\left( {H_{1} , H_{2} } \right) = \sum\nolimits_{I = 0}^{N - 1} {\hbox{min} \left( {H_{1} \left( I \right), H_{2} \left( I \right)} \right)} $$
(6)

Kullback-Leibler (KL).

Kullback-Leibler pseudo-distance is an asymmetrical measure that does not meet the condition (b) of the distance definition introduced earlier. This measure originated from the information theory for handling relative entropy. It is used to measure the average bit number required to identify an event from a set of possibilities, and numerically indicates how two histograms resemble each other. It is defined by the following equation [6]:

$$ d\left( {H_{1} , H_{2} } \right) = \sum\nolimits_{I = 0}^{N - 1} {H_{1} \left( I \right)\, log\frac{{H_{1} \left( I \right)}}{{H_{2} \left( I \right)}}} $$
(7)

Euclidian.

Euclidean distance is frequently used for evaluating distances in numerical spaces. It is used to determine the bin to bin distance between two histograms and is calculated according to the following equation [3]:

$$ d \left( {H_{1} , H_{2} } \right) = \sqrt {\sum\nolimits_{I = 0}^{N - 1} {(H_{1} \left( I \right) - H_{2} \left( I \right))^{2} } } $$
(8)

2.2 Indetermination

As it can be seen from Eqs. (4) and (7), the chi-square and KL distances can be undefined. In his work, Marín [2] simply discard the bins that are zero to avoid this indetermination. However, this solution is inappropriate. For example, according to Marín the two histograms shown in Fig. 1(a) and (b) would be defined as equal [2]. Given that zero bins are discarded making Dchi-square (A, B) and DKL (A, B) to be zero. However, according to the distance definition, distance between two histograms A and B is zero only when A = B. Therefore, in order to solve the indetermination, we considered the following solutions:

Fig. 1.
figure 1

Example of a critical case when measuring distance between two histograms (a) and (b).

Chi-square.

Indetermination can be solved by using Eq. (9) instead of Eq. (4):

$$ \sum\nolimits_{I = 0}^{N - 1} {\frac{{\left[ {H_{1} \left( I \right) - H_{2} \left( I \right)} \right]^{2} }}{{H_{1} \left( I \right) + 1}}} $$
(9)

This solution is equal to adding one count to both histograms given that

$$ \sum\nolimits_{I = 0}^{N - 1} {\frac{{\left[ {\left( {H_{1} \left( I \right) + 1 - (H_{2} \left( I \right) + 1} \right)} \right]^{2} }}{{H_{1} \left( I \right) + 1}} = \sum\nolimits_{I = 0}^{N - 1} {\frac{{\left[ {(H_{1} \left( I \right) - H_{2} \left( I \right)} \right]^{2} }}{{H_{1} \left( I \right) + 1}}} } $$
(10)

This solution simply produces a reduction in each one of the addends. Note that a very small value \( \varepsilon \) must not be added because it would produce a very big addend introducing an error in distance computation. Figures 2(a, b) provide a graphical representation of both the original and the proposed formulas showing how close they are.

Fig. 2.
figure 2

Graphical representation of chi-square and KL functions: (a, c) original equations, (b, d) modified equations.

KL.

Contrary to chi-square, the indetermination in KL distances can be avoided by using the following expression:

$$ \sum\nolimits_{I = 0}^{N - 1} {H_{1} \left( I \right)*Log\left( {\frac{{H_{1} \left( I \right) + \varepsilon }}{{H_{2} \left( I \right) + \varepsilon }}} \right)} $$
(11)

where \( \varepsilon \) is a small quantity (epsilon). We took \( \varepsilon = 0.0001 \) for our calculation. We do not take \( \varepsilon = 1 \) because when \( H_{1} \left( I \right) \) and \( H_{2} \left( I \right) \) are considerably small, \( Log\left( {\frac{{H_{1} }}{{H_{2} }}} \right) \) is very different to that of \( Log\left( {\frac{{H_{1} + 1}}{{H_{2} + 1}}} \right). \) However, as seen in Figs. 2(c, d), \( Log\left( {\frac{{H_{1} }}{{H_{2} }}} \right) \cong Log\left( {\frac{{H_{1} + \varepsilon }}{{H_{2} + \varepsilon }}} \right) \), when \( \varepsilon \) is considerably small.

3 Proposal

To test the introduced distances, we considered five synthetic images designed by one of the team members of our University (see Fig. 3), four synthetic histograms were implemented (see Fig. 4), two microscopy images having a background taken with different illumination conditions (see Fig. 5), four microscopy images of a rat brain (acquired from the Instituto de Neurociencias de Castilla y León, Salamanca, Spain, see Fig. 6), and two images of the same objects but with different magnification (see Fig. 7). Distances were calculated for the following cases: between the image in Fig. 3(a) and each one of its modified variations presented in Figs. 3(b–e); between histograms a, b, c, and d in Fig. 4; and between the images in Figs. 5, 6, and 7. Distances were implemented as plugins for the open source program ImageJ [8].

Fig. 3.
figure 3

Synthetic images: (a) original, (b) inverted, (c) high gloss, (d) low gloss, (e) high contrast, (f) histogram of (a).

Fig. 4.
figure 4

Synthetic histograms: (a) original (b) inverted, (c) only odd values – “odd hist”, (d) only even values – “even hist”.

Fig. 5.
figure 5

Microscopy images: (a) intensity 1, (b) intensity 2.

Fig. 6.
figure 6

Images of a rat brain: (a) normal intensity (CRnorm), (b) low light intensity (CRlow), (c) high light intensity (CRhigh), (d) contrast adjustment (CRcontr). (Color figure online)

Fig. 7.
figure 7

Images of the same objects: (a) Dist1, (b) Dist2.

4 Results

Results are shown in Tables 1, 2, 3, 4 and 5. In column 1 are of the distance between a histogram and itself. It can be seen that, every distance is zero except that for the correlation and intersection. This is because these two measures are not true distances according to the distance definition provided in Sect. 2.1. The correlation distance can even have negative values. Also, the intersection can have a zero value if there are not common bins between two histograms, but it is maximal when they are equal. In Tables 1, 2, 3, 4 and 5, beside each distance name are indicators in quotes, which correspond to the best equality approximation between histograms, for instance, the best result for Intersection is the largest value, the Correlation distance equals to one, the Chi-square, Bhattacharyya, KL, and Euclidian are all equal to zero. Chi-square distances are very similar between the original and the inverted images than between the original and the high gloss images, which are quite different, since the synthetic image has very few gray levels.

Table 1. Distances between images in Fig. 3.
Table 2. Distances between synthetic histograms in Fig. 4.
Table 3. Distances between images in Fig. 5.
Table 4. Results of distance measures between the respective images in Fig. 6.
Table 5. Distances between images in Fig. 7.

Table 1 shows the distances between synthetic images with a low quantity of gray levels. It can be seen from Table 1 that distances for the original and inverted images are quite dissimilar and there is no correlation between them. Results are appropriate only for the original vs. low gloss and original vs. contrast measures.

The results obtained for synthetic histograms Fig. 4 are shown in Table 2.

As seen in Table 2, correlation gives a good indication that the histogram in Fig. 4(b) is the inverted histogram in Fig. 4(a). When comparing “odd hist” with “even hist”, it can be noticed that they are very similar, whereas, the intersection is 0, given that there is no intersection between them. The correlation also does not indicate that these are similar histograms.

Column 2 in Table 3 shows that the images in Fig. 5 are quite different.

Color images in Fig. 6 were compared in the RGB color space by measuring the distance between each channel histogram and averaging the result as suggested by Prashant [7]. Distances calculated between the images in Fig. 6 are shown in Table 4. As seen in Table 4, the variation in the object intensity can make the distances to indicate that histograms are quite different. This suggests that image intensities must be similar to enable histogram comparison. At the same time, the Bhattacharyya distance is the most robust for intensity variations.

Columns 2 and 3 in Table 4 show that brightness variation increase distances. This occurs for all distances except the Bhattacharyya distance in columns 3 and 4, where the most similar values are obtained. To verify this observation, comparisons of RGB color photographs showing objects with similar light intensities at different distances were performed.

The results in Table 5 show that histogram distances are not good indicators of how close two images are when their lightning is quite different. The Bhattacharyya distance shows the best performance even when intensities are dissimilar. When lightning is similar, the correlation and the Bhattacharyya distance show the best results.

The last experiment also shows that other problem in calculating distances occurs if histograms do not contain spatial information about the images and two different images may coincide in their histogram representations.

5 Conclusions

In this work, the performance of six distances measures the similarity between histograms in image analysis tasks was compared. Some of the considered distance measures are not true distances, namely KL distance and correlation.

The chi-square and KL distances were modified to avoid indeterminations when bins are equal to zero. We showed how the proposed solution is more effective compared to the one introduced originally; it prevents two different histograms to appear as equal when indetermination occurs.

It was found that the considered distance measures show bad results when histograms are not continuous, or images of the same objects have a high-intensity variation; only the Bhattacharyya distance showed that two images with the same objects were close when their intensity was very different. When lightning was similar, the Bhattacharyya distance and the correlation performed the best. It was also found that while the correlation is not a true distance, it can be useful for comparing histograms to show how two histograms are related.

In the future, we want to analyze a greater number of distance measures such as the EMD (Earth’s moving distance) and test their performance in a higher number of images using histograms from more color spaces.