Keywords

1 Introduction

Classification is an important task in Pattern Recognition. The goal in supervised classification is to assign a new object to a category based on its features [1]. Applications in this subject use training data in order to model the distribution of features for each class. In this work we propose the use of bivariate copula functions in order to design a probabilistic model. The copula function allows us to properly model dependencies, not necessarily linear dependencies, among the object features.

By using copula theory, a joint distribution can be built with a copula function and, possibly, several different marginal distributions. Copula theory has been used for modeling multivariate distributions in unsupervised learning problems [3, 5, 9, 13] as well as in supervised classification [4, 6, 7, 10, 12, 14, 15]. For instance, in [4], a challenging classification problem is solved by means of copula functions and vine graphical models. However, all marginal distributions are modelled with gaussian distributions and the copula parameter is calculated by inverting Kendall’s tau. In [10, 15], simulated and real data are used to solve classification problems within the framework of copula theory. No graphical models are employed and marginal distributions are based on parametric models. In this paper, we employed flexible marginal distributions such as Gaussian kernels and the copula parameter is estimated by using the maximum likelihood method. Moreover, the proposed classifier takes into account the most important dependencies by means of a graphical model. The reader interested in applications of copula theory in supervised classification is referred to [6, 7, 12, 14].

The content of the paper is the following: Sect. 2 is a short introduction to copula functions, Sect. 3 presents a copula based probabilistic model for classification. Section 4 presents the experimental setting to classify an image database, and Sect. 5 summarizes the results.

2 Copula Functions

The copula theory was introduced by [11] to separate the effect of dependence from the effect of marginal distributions in a joint distribution. Although copula functions can model linear and nonlinear dependencies, they have rarely been used in supervised classification where nonlinear dependencies are common and need to be represented.

Definition 1

A copula function is a joint distribution function of standard uniform random variables. That is,

$$\begin{aligned} C(u_{1},\ldots ,u_{d}) = {Pr} [U_{1} \le u_{1}, \ldots , U_{d} \le u_{d}] , \end{aligned}$$

where \(U_{i} \sim U(0,1)\) for \(i=1,\ldots ,d.\)

Due to the Sklar’s Theorem, any d-dimensional density f can be represented as

$$\begin{aligned} f(x_{1},\ldots ,x_{d}) = c(F_{1}(x_{1}),\ldots ,F_{d}(x_{d})) \cdot \prod _{i=1}^{d} f_{i}(x_{i}) , \end{aligned}$$
(1)

where c is the density of the copula C, \(F_{i}(x_{i})\) is the marginal distribution function of random variable \(x_{i}\), and \(f_{i}(x_{i})\) is the marginal density of variable \(x_{i}\). Equation (1) shows that the dependence structure is modeled by the copula function. This expression separates any joint density function into the product of copula density and marginal densities. This is contrasted with the usual way to model multivariate distributions, which suffers from the restriction that the marginal distributions are usually of the same type. The separation between marginal distributions and a dependence structure explains the modeling flexibility given by copula functions.

Table 1. Bivariate copula densities.

In this paper we use two-dimensional parametric copula functions to model the dependence structure of random variables associated by a joint distribution function. The densities of these copula functions are shown in Table 1. We consider the Farlie-Gumbel-Morgenstern (FGM) copula function, elliptical copulas (Gaussian) and archimedean copulas (Independent, Ali-Mikhail-Haq (AMH), Clayton, Frank, Gumbel). These copula functions have been chosen because they cover a wide range of dependencies. For instance, the AMH, Clayton, FGM, Frank and Gaussian copula functions can model negative and positive dependences between the marginals. One exception is the Gumbel copula, which does not model negative dependence. The AMH and FGM copula functions are adequate for marginals with modest dependence. When dependence is strong between extremes values, the Clayton and Gumbel copula functions can model left and right tail association respectively. The Frank copula is appropriate for data that exhibit weak dependence between extreme values and strong dependence between centered values, while the Gaussian copula is adequate for data that exhibit weak dependence between centered values and strong dependence between extreme values. In general, when the Gaussian copula is used with standard Gaussian marginals, then the joint probabilistic model is equivalent to a multivariate normal distribution.

The dependence parameter \(\theta \) of a bivariate copula function can be estimated using the maximum likelihood method (ML). To do so, the one-dimensional log-likelihood function

$$\begin{aligned} \ell \left( \theta ; \left\{ ({u_{1}}_{i},{u_{2}}_{i})\right\} _{i=1}^{n}\right) = \sum _{i=1}^{n} \text {log} \left( c({u_{1}}_{i},{u_{2}}_{i};\theta ) \right) , \end{aligned}$$
(2)

is maximized. Assuming the marginal distributions are known, the pseudo copula observations \(\left\{ ({u_{1}}_{i},{u_{2}}_{i})\right\} _{i=1}^{n}\) in Eq. (2) are obtained by using the marginal distribution functions of variables \(X_{1}\) and \(X_{2}\). Once the maximum likelihood estimator of \(\theta \) has been found, it is represented by the notation \({\hat{\theta }}\). It has been shown in [16] that the ML estimator \({\hat{\theta }}\) has better properties than other estimators.

3 The Probabilistic Model for Classification

The proposed classifier explicitly considers dependencies among variables. The dependence structure for the design of the probabilistic classifier is based on a chain graphical model. Such model, for a d-dimensional continuous random vector \(\varvec{X}\), represents a probabilistic model with the following density:

$$\begin{aligned} f_{\text {chain}}({\mathbf {x}}) = f\left( x_{\alpha _{1}}\right) \prod _{i=2}^{d} f\left( x_{\alpha _{i}}|x_{\alpha _{(i-1)}}\right) , \end{aligned}$$
(3)

where \(\varvec{\alpha }=(\alpha _{1},\ldots ,\alpha _{d})\) is a permutation of the integers between 1 and d. Figure 1 shows an example of a chain graphical model for a three dimensional vector. Notice that a permutation could not be unique, in the sense that different permutations could yield the same density values in (3).

Fig. 1.
figure 1

Joint distribution over three variables represented by a chain graphical model.

In practice the permutation \(\varvec{\alpha }\) is unknown and the chain graphical model must be learnt from data. A way of choosing the permutation \(\varvec{\alpha }\) is based on the Kullback-Leibler divergence (\(D_{KL}\)). This divergence is an information measure between two distributions. It is always non-negative for any two distributions, and is zero if and only if the distributions are identical. Hence, the Kullback-Leibler divergence can be interpreted as a measure of the dissimilarity between two distributions. Then, the goal is to choose a permutation \(\varvec{\alpha }\) that minimizes the Kullback-Leibler divergence between the true distribution \(f({\mathbf {x}})\) of the data set and the distribution associated to a chain model, \(f_{\text {chain}}({\mathbf {x}})\). For instance, the Kullback-Leibler divergence between joint densities f and \(f_{\text {chain}}\) for a continuous random vector \({\mathbf {X}}=(X_{1},X_{2},X_{3})\) is given by:

$$\begin{aligned} D_{KL} \left( f || f_{\text {chain}} \right)= & {} E_{f} \left[ \text {log } \dfrac{f({\mathbf {x}})}{f_{\text {chain}}({\mathbf {x}})} \right] \nonumber \\= & {} -H({\mathbf {X}}) + \int \text {log} \left( f\left( x_{\alpha _{1}}\right) f\left( x_{\alpha _{2}}|x_{\alpha _{1}}\right) f\left( x_{\alpha _{3}}|x_{\alpha _{2}}\right) \right) f dx. \end{aligned}$$
(4)

The first term in Eq. (4), \(H({\mathbf {X}})\), is the entropy of the joint distribution \(f({\mathbf {x}})\) and does not depend on the permutation \(\varvec{\alpha }\). By using copula theory and Eq. (1), the second term can be decomposed into the product of marginal distributions and bivariate copula functions.

$$\begin{aligned} D_{KL} \left( f || f_{\text {chain}} \right)= & {} -H({\mathbf {X}}) + \sum _{i=1}^{d} H(X_{i}) \nonumber \\&- \int \text {log} \left( c\left( u_{\alpha _{1}}, u_{\alpha _{2}}; {\hat{\theta }}_{\alpha _{1}, \alpha _{2}} \right) \right) f dx \nonumber \\&- \int \text {log} \left( c\left( u_{\alpha _{2}}, u_{\alpha _{3}}; {\hat{\theta }}_{\alpha _{2}, \alpha _{3}} \right) \right) f dx . \end{aligned}$$
(5)

The second term of Eq. (5), the sum of marginal entropies, also does not depend on the permutation \(\varvec{\alpha }\). Therefore, minimizing Eq. (5) is equivalent to maximize the sum of the last two terms. Once a sample of size n is obtained from the joint density f, the last two terms can be approximated by a Monte Carlo approach:

$$\begin{aligned} \int \text {log} \left( c\left( u_{\alpha _{1}}, u_{\alpha _{2}}; {\hat{\theta }}_{\alpha _{1}, \alpha _{2}} \right) \right) f dx \approx \dfrac{1}{n} \sum _{i=1}^{n} \text {log} \left( c\left( {u_{1}}_{i}, {u_{2}}_{i}; {\hat{\theta }}_{\alpha _{1}, \alpha _{2}} \right) \right) . \end{aligned}$$
(6)

Through Eq. (6), the \(D_{KL}\) is minimized by maximizing the sum of the log-likelihood for the copula parameters. It is worth to noting that the log-likelihood allows us to estimate the copula parameter and to select the appropriate permutation \(\varvec{\alpha }\). Finally, by means of copula theory, a chain graphical model for a three dimensional vector has the density

$$\begin{aligned} f_{\text {chain}}({\mathbf {x}}) = f\left( x_{\alpha _{1}}\right) f\left( x_{\alpha _{2}}\right) f\left( x_{\alpha _{3}}\right) c \left( u_{\alpha _{1}},u_{\alpha _{2}}\right) c \left( u_{\alpha _{2}},u_{\alpha _{3}}\right) \end{aligned}$$
(7)

3.1 The Probabilistic Classifier

Here, we present the incorporation of bivariate copula functions and a chain graphical model in order to design a probabilistic classifier.

The Bayes’ theorem states the following:

$$\begin{aligned} P(K=k|{\mathbf {X}}={\mathbf {x}}) = \dfrac{P({\mathbf {X}}={\mathbf {x}}|K=k) \times P(K=k)}{P({\mathbf {X}}={\mathbf {x}})} , \end{aligned}$$
(8)

where \(P(K=k|{\mathbf {X}}={\mathbf {x}})\) is the posterior probability, \(P({\mathbf {X}}={\mathbf {x}}|K=k)\) is the likelihood function, \(P(K=k)\) is the prior probability and \(P({\mathbf {X}}={\mathbf {x}})\) is the data probability.

Equation (8) has been used as a tool in supervised classification. A probabilistic classifier can be designed comparing the posterior probability that an object belongs to the class K given its features \({\mathbf {X}}\). The object is then assigned to the class with the highest posterior probability. For practical reasons, the data probability \(P({\mathbf {X}})\) does not need to be evaluated for comparing posterior probabilities. Furthermore, the prior probability P(K) can be substituted by a uniform distribution if the user does not have an informative distribution.

The joint density in Eq. (7) can be used for modeling the likelihood function in Eq. (8). In this case, the Bayes’ theorem can be written as:

$$\begin{aligned} P(K=k|{\mathbf {x}}) = \dfrac{ \prod _{j=1}^{2} c(F_{\alpha _{j}},F_{\alpha _{(j+1)}}|k; {\hat{\theta }}_{\alpha _{j}, \alpha _{(j+1)}}) \cdot \prod _{i=1}^{3} f_{i}(x_{i}|k) \cdot P(K=k)}{f(x_{1},x_{2},x_{3})} \end{aligned}$$
(9)

where \(F_{i}\) are the marginal distribution functions and \(f_{i}\) are the marginal densities for each feature. The function c is a bivariate copula density taken from Table 1. As can be seen in Eq. (9), each class determines a likelihood function.

4 Experiments

We use Eq. (9) and copula functions from Table 1 in order to classify pixels of 50 test images. Hence, we prove seven probabilistic classifiers. The image database was used in [2] and is available online [8]. This image database provides information about two classes: the foreground and the background. The training data and the test data are contained in the labelling-lasso files [8], whereas the correct classification is contained in the segmentation files. Figure 2 shows the description of one image from the database. Although the database is used for segmentation purposes, the aim of this work is to model dependencies in supervised classification. Only color features are considered for classifying pixels.

Fig. 2.
figure 2

(a) The color image. (b) The labelling-lasso image with the training data for background (dark gray), for foreground (white) and the test data (gray). (c) The correct classification with foreground (white) and background (black). (d) Classification made by independence. (e) Classification made by Frank Copula.

Three evaluation measures are used in this work: accuracy, sensitivity and specificity. These measures are described in Fig. 3. The sensitivity and specificity measures explain the percentage of well classified pixels for each class, foreground and background, respectively. We define the positive class as the foreground and the negative class as the background.

Fig. 3.
figure 3

(a) A confusion matrix for binary classification, where tp are true positive, fp false positive, fn false negative, and tn true negative counts. (b) Definitions of accuracy, sensitivity and specificity used in this work.

4.1 Numerical Results

In Table 2 we summarize the measure values reached by the classifiers according to the copula function used to model the dependencies.

Table 2. Descriptive results for all evaluation measures. The results are presented in percentages.

To properly compare the performance of the probabilistic classifiers, we conducted an ANOVA test for comparing the accuracy mean among the classifiers. The test reports a statistical difference between Clayton, Frank, Gaussian and Gumbel copula functions with respect to the Independent copula (p-value < 0.05). The major difference of accuracy with respect to the independent copula is given by the Frank copula.

4.2 Discussion

According to Table 2, the classifier based on the Frank copula shows the best behavior for accuracy. For sensitivity, Frank and Gaussian copulas provide the best results. The best mean specificity is reached by the classifier based on the Clayton copula.

As can be seen, the average performance of a classifier is improved by the incorporation of the copula functions. The lowest average performance corresponds to the classifier that uses the independence assumption. Figure 4 shows how the accuracy is increased when dependencies are taken into account by the probabilistic classifier. The line of Fig. 4(a) represents the identity function, so the points above this line correspond to a better accuracy than the accuracy achieved by the classifier based on the independent copula. To get a better insight, Fig. 4(b) shows the difference in accuracy between using copula functions respect to the naive classifier (independent copula).

Fig. 4.
figure 4

(a) Scatterplot of the accuracy values between classifier based on independence assumption (horizontal axis) and classifiers based on copula functions (vertical axis). (b) The gain of accuracy by using copula functions.

Table 2 also shows information about the standard deviations for each evaluation measure. For accuracy, the standard deviation indicates that using a Frank copula in pixel classification is more consistent than the other classifiers.

Figure 2 shows the results of one of the 50 images mentioned before, once we worked on them. In (d), we can see the resultant image when it is classified by independence, (e) shows the same image classified by Frank copula. It is possible to visually perceive the improvement that the use of Frank copula provides to the classifier. For this image, the color data for each class is shown in Fig. 5. In this case, it can be seen that the dependence structure does not correspond to the dependence structure of a bivariate Gaussian distribution. According to the numerical results, the copula Frank is the best model for this kind of dependence.

Fig. 5.
figure 5

The first line shows the scatterplot among (a) red and green, (b) red and blue, and (c) green and blue colors for the foreground class. The second line similarly shows the scatterplots for the background class. (Color figure online)

5 Conclusions

In this paper we have compared the performance of several copula based probabilistic classifiers. The results show that the dependence among features provides important information for supervised classifying. For the images used in this work, the Gumbel copula performs very well in most of the cases. One advantage of using a chain graphical model consists in detecting the most important dependencies among variables. This can be valuable for different applications where associations among variables gives additional knowledge of the problem. Though accuracy is increased by the classifiers based on copula functions, the selection of the copula function has relevant consequences for the performance of the classifier. For instance, in Fig. 4, a few classifiers do not improve the performance achieved by the classifier based on the independent copula. It suggests more experiments are needed in order to select the adequate copula function for a given problem. Moreover, as future work, the classifier based on copula functions must be proved in other datasets and compared with other classifiers in order to achieve a better insight of its benefits and limitations.