Keywords

1 Introduction

In statistical pattern recognition, an object is conventionally represented as a vector whose entries correspond to numerical values of its features. Therefore, in such a representation, objects are points in a vector space: the well-known feature space. However, this conventional representation is often inconvenient, particularly when the extraction of features from symbolic data (such as graphs and grammars) or from raw sensor measurements (such as signals and images) is difficult or even when it is not clear how to do it in the first place. As an alternative, Pekalska and Duin proposed [13] the option of measuring dissimilarities between pairs of objects and organizing them as vectors such that each object is represented as a point in the so-called dissimilarity space [7] where any classifier can be trained and applied. The dissimilarity representation is within the field of (dis)similarity pattern recognition that has been actively researched during the last years [14, 15].

In many pattern classification problems it is mandatory to normalize the feature space, i.e. to make comparable the ranges of the different features that can derive from different measures/sensors: the typical approach is to apply a linear scaling to the axes of the vector space; this operation guarantees that the classifier decision equally takes into account values in all directions, once the unwanted influences of their original dynamic ranges have been removed. In the dissimilarity space, in contrast, range differences among the directions tend to be less notorious because all the features are of the same nature, i.e. they are all distances to the objects of the reference group, formally called the representation set. For this reason linear scaling is less crucial. However, other more complex scaling operations, such as those involving non linear transformations, can be very useful and lead to improvements in the classification – this has been suggested also for classical feature spaces [1, 4, 5, 11, 17]. For dissimilarity spaces, Duin et al. [6] found that the non-linear scaling of given dissimilarities by their power transformation appears to be useful for improving the nearest neighbor performance in the dissimilarity space. They studied its behavior in terms of classification error and found that raising dissimilarities to powers less than 1 often contributes to such an improvement. When trying to explain the phenomenon, they suggested that the benefits derive from the following three properties: when applying a power transformation with power less than 1, (i) objects tend to be equally distant from the others, (ii) distances to outliers are shrunk, and (iii) the neighborhood of each object is enlarged by emphasizing distances between close objects.

In their study, as well as in the others related to classical feature spaces cited above [4, 5, 11, 17], the estimation of the proper power parameter represents a crucial open issue; typically such parameter is set by hand, or found by an exhaustive search; in [6] it is estimated via the computationally prohibitive cross validation. In this paper we propose a novel unsupervised criterion which can guide the selection of the parameter of the power transformation: this criterion tries to find a compromise – as the power parameter approaches to zero – between the reduction in the dispersion in the data and the increase in the intrinsic dimensionality of the resulting dissimilarity space (if a too small power is applied all points are converging around 1). This criterion is unsupervised – since it does not require labels – and computationally more feasible than cross validation – since it does not require repeated training of classifiers. A thorough experimental evaluation on several different datasets shows that by applying the power transformation with the best parameter according to the proposed criterion we obtain accuracies which are (i) almost always significantly better than those obtained in the space without the preprocessing and (ii) many times equivalent or better than those obtained by the computationally expensive cross validation procedure.

The rest of the paper is organized as follows: in Sect. 2 we briefly summarize the dissimilarity space and the non linear scaling by power transformation; then, in Sect. 3 we detail the proposed approach; the experimental evaluation is presented in Sect. 4; finally, in Sect. 5, conclusions are drawn and future perspectives are envisaged.

2 Background

2.1 The Dissimilarity Space

The vector arrangement of the dissimilarities computed between a particular object x and other objects from a set \(\mathcal {R}\) allows representing x as a point in a vector space. Such a space is called the dissimilarity space, having in principle as many dimensions as the cardinality of \(\mathcal {R}\), which is known as the representation set. For a set of training objects \(\mathcal {T}\), the set \(\mathcal {R}\) builds a so-called dissimilarity representation in the form of a dissimilarity matrix \(\mathbf {D}(\mathcal {T}, \mathbf {\mathcal {R}})\). The representation set is often the same as the training set, so \(\mathbf {D}(\mathcal {T}, \mathcal {R}) = \mathbf {D}(\mathcal {T}, \mathcal {T})\). For notation simplicity, hereafter we simply use \(\mathbf {D}\) to refer to the square dissimilarity matrix \(\mathbf {D}(\mathcal {T}, \mathcal {T})\).

Several studies [7, 13] have shown the possibilities of training classifiers in the dissimilarity space, such that a test object represented in terms of its dissimilarities to \(\mathcal {R}\) can be classified by a more sophisticated rule than the nearest neighbor classifier on the given dissimilarities (i.e. template matching, denoted as 1-NN). The classifier in the dissimilarity space can even be the same nearest neighbor rule but now based on distances between points in the dissimilarity space; here we denote that case as 1-NND in order to distinguish it from template matching.

2.2 Non Linear Scaling

Raising all dissimilarities to the same power is a simple and straightforward non linear scaling. For a dissimilarity matrix \(\mathbf {D}\), such a transformation can be written as follows:

$$\begin{aligned} \mathbf {D}^{\star {\rho }} = (d_{ij}^{\rho }), \qquad \rho > 0 \end{aligned}$$
(1)

where each entry, \(d_{ij} = d(x_i,x_j)\), of the matrix denotes the dissimilarity between two objects \(x_i\) and \(x_j\) and \(^\star \) denotes the entrywise (Hadamard) power function [9]. There exists an optimal value for \(\rho \) that provides the best 1-NND classification performance. Let’s denote it as \(\rho ^{*}\). In most cases, \(\rho ^{*}\) is lower than 1. This is reasonable, since with \(\rho <1\) we have a concave function that raises low values and shrinks high values: for dissimilarities, this may have a good impact on the representation in the dissimilarity spaces, since it reduces the impact of outliers (large distances are reduced) and increases the importance of the neighborhood (small distances are increased).

Therefore, we only consider to search for an estimate \(\widehat{\rho ^{*}}\) in the interval (0, 1]. Below we explain the existing method to estimate \(\rho ^{*}\) by cross validation, followed in Sect. 3 by the explanation of our proposed estimation via the optimization of an unsupervised criterion.

Optimization via Cross Validation. A typical procedure to optimize the value of a parameter is by searching over the parameter domain for the lowest cross validation classification error. This strategy was the one used by Duin et al. [6] for finding the best parameter, which we call in this case \(\widehat{\rho ^{*}}\!\!_{cv}\), as follows:

$$\begin{aligned} \widehat{\rho ^{*}}\!\!_{cv} = \mathop {{{\mathrm{arg\,min}}}}\limits _{{\rho \in (0,1]}} \epsilon _{1-NND}(\mathbf {D}^{\star \rho }) \end{aligned}$$
(2)

where \(\epsilon _{1-NND}\) denotes the leave-one-out cross validation error of 1-NND. Even though experiments in [6] suggested that this optimization permits a good classification performance, it might become computationally prohibitive for large datasets. Moreover, such criterion does not permit to understand what is happening with the non linear scaling, i.e. it does not provide an explanation of the topological effect of the parameter value in the space.

3 The Proposed Criterion

As introduced before, when applying a power transformation with \(\rho <1\), we obtain a two-fold effect on data in the dissimilarity space. First, the dispersion of the values in each dimension of the space is shrunk (by raising small distances and reducing large distances); second, the neighborhood of each point is highly emphasized (raising small distances). This behaviour is becoming more and more extreme when \(\rho \) approaches zero. Clearly, up to some extent these effects are desirable, in order to reduce the impact of outliers (distances to far away points are reduced) and to better characterize the neighborhood of each object (distances to nearby points are raised); however, after a certain point such positive effects are lost, since all points tend to be equally spaced in the space, thus loosing all the information contained in the original dissimilarity matrix. This effect can be monitored by looking at the intrinsic dimensionality of the data, which increases when points tend to be more equally spaced. Therefore, using a criterion that optimizes a trade-off between those two effects (reduction of dispersion and increase of the intrinsic dimensionality) seems a reasonable way to find \(\widehat{\rho ^{*}}\).

Among the available dispersion measures, the quartile coefficient of dispersion (qcd) [10, p. 15] is a robust statistical estimator that gives a scale-free measure of data spread. It is given as:

$$\begin{aligned} qcd = \frac{Q_3 - Q_1}{Q_3 + Q_1}, \end{aligned}$$
(3)

where \(Q_3\) and \(Q_1\) are the third and first quartiles, respectively. In our case, they are computed as follows: for each column (dimension) of \(\mathbf {D}^{\star \rho }\), we find the median of the upper half of the values (which is \(Q_3\), also called the 75th percentile) and the median of the lower half of them (which is \(Q_1\), also called the 25th percentile).

Similarly, there are many methods to estimate the intrinsic dimensionality (id) of a dataset, see for instance the reviews by Camastra [2, 3]. We have a adopted the one described in [13, p. 313] which directly computes the estimation from dissimilarity data:

$$\begin{aligned} \widehat{id}(\mathbf {D}) = \left\lceil 2\frac{(\mathbf {1}^{\top }\mathbf {D}^{\star 2}\mathbf {1})^2}{n(n - 1)\mathbf {1}^{\top }\mathbf {D}^{\star 4}\mathbf {1} - (\mathbf {1}^{\top }\mathbf {D}^{\star 2}\mathbf {1})^2} \right\rceil \end{aligned}$$
(4)

where \(\mathbf {D}^{\star 2} = (d_{ij}^2)\), \(\mathbf {D}^{\star 4} = (d_{ij}^4)\) and n is the number of columns (and rows) of the square matrix \(\mathbf {D}\).

Given these definitions, our criterion tries to determine the best parameter (which we call \(\widehat{\rho ^{*}}\!\!_{nlm}\)) by optimizing the compromise between (i) the average – or, better, its robust estimate, the median – of the dispersion (3) per dimension and (ii) the intrinsic dimension of (4) computed for the pairwise distances in the dissimilarity space, that is, for a matrix of Euclidean distances \(\mathbf {D}_{DS}\) between pairs of points in the dissimilarity space. The final criterion can be written as:

$$\begin{aligned} \widehat{\rho ^{*}}\!\!_{nlm} = \mathop {{{\mathrm{arg\,min}}}}\limits _{{\rho \in (0,1]}} \left[ \mathop {{{\mathrm{median}}}}\limits _{{1 \le i \le n}}\left( qcd_i \right) \times \widehat{id}\left( \mathbf {D}_{DS}^{\star \rho }\right) \right] \end{aligned}$$
(5)

Notice that, even though there are several alternatives to define a compromise between two variables, we have chosen to minimize the product between them. A multiplicative criterion has also been adopted in other scenarios [8, 12] where the two variables of interest are related in a non-trivial way.

3.1 Inductive and Transductive Versions

The criterion introduced in the previous section is completely unsupervised: exploiting this property, we investigate its usefulness in two different flavours, which we called “Version 1” and “Version 2”, respectively:

  1. 1.

    Version 1 (\(\widehat{\rho ^{*}}\!\!_{nlm1}\)): the best parameter is the one optimizing the proposed criterion on the training set: this represents the classical learning, also known as inductive inference [16, p. 577], where the criterion is determined by using only the training objects.

  2. 2.

    Version 2 (\(\widehat{\rho ^{*}}\!\!_{nlm2}\)): the best parameter is the one optimizing the proposed criterion on the whole dataset, clearly by ignoring the labels. This represents the so called transductive learning [18] where all the available objects are used: the training objects, for which we can employ the labels, and the testing objects, for which labels are unknown. Since the proposed criterion does not take into account the labels, the transductive learning can be applied.

Table 1. Datasets employed for empirical evaluation.

4 Experimental Results

The proposed approach has been tested using a set of public domain datasetsFootnote 1 (also employed in [6]) – see Table 1. Most of them are derived from real objects (images, text, protein sequences). The Chickenpieces dataset consists out of 44 dissimilarity matrices: in the tables, the average characteristics are shown. In our empirical evaluation we compared the errors made by the Nearest Neighbor ruleFootnote 2 (errors of 1-NND) in four different versions of the dissimilarity space:

  1. 1.

    Original: this is unprocessed case (no transformation is applied), i.e. the dissimilarity space is built using the original dissimilarity matrix \(\mathbf {D}\).

  2. 2.

    NL-Cross Val: in this case the dissimilarity space is built starting from \(\mathbf {D}^{\star {\widehat{\rho ^{*}}\!\!_{cv}}}\), i.e. after applying a non linear transformation where the optimal parameter is chosen by optimizing the LOO error on the training set. As said before, this represents the criterion proposed in [6].

  3. 3.

    NL-Disp (ver. 1): in this case the dissimilarity space is built starting from \(\mathbf {D}^{\star {\widehat{\rho ^{*}}\!\!_{nlm1}}}\), i.e. after applying non linear transformation with parameter chosen by optimizing the proposed criterion on the training set.

  4. 4.

    NL-Disp (ver. 2): in this case the dissimilarity space is built starting from \(\mathbf {D}^{\star {\widehat{\rho ^{*}}\!\!_{nlm2}}}\), i.e. after applying a non linear transformation with parameter chosen by optimizing the proposed criterion on the whole dataset (in a transductive way, see previous section).

Table 2. 1NN-D errors for the different datasets. Between brackets we reported the standard errors of the mean.

Errors have been computed using averaged hold out cross validation, i.e. by using half of the dataset for training (and representation) and the remaining half for testing. In order to ensure robust estimation of errors, this procedure has been repeated 200 times, and results are averaged. For criteria 2–4, the best value has been chosen in the range \(1.25^{-15},1.25^{-14.5}, 1.25^{-14}, ... , 1\) for the exponent. Averaged errors, together with standard errors of the mean, are reported in Table 2. In order to get a more direct view on the results, we reported in Table 3 an improvement/degradation table, as resulting from several different pairwise statistical tests. In particular, we compared errors obtained with the proposed criterion (NL-Disp in both versions v1 and v2) with those obtained without transforming the space (Original) and with the parameter chosen via Cross Validation (NL-Cross Val). As statistical test we employed the paired t-test, comparing the 200 errors obtained with the 200 repetitions of the cross validation. In the table, we used five different symbols:

Table 3. Pairwise statistical comparisons: “\(\uparrow \)” indicates a statistically significant improvement (results with our criterion are better), “\(\downarrow \)” a statistically significant degradation (results with our criterion are worst), whereas “\(\approx \)” indicates that the two methods are equivalent (i.e. there is no statistically significant difference).
  • the symbols “\(\uparrow \)” and “\(\uparrow \uparrow \)” indicate a statistically significant improvement (results with our criterion are better): the former indicates that the test passed with a p-value less than 0.05 but greater than 0.001, whereas in the latter case the p-value was less than 0.001;

  • \(\downarrow \)” and “\(\downarrow \downarrow \)” indicate a statistically significant degradation (results with our criterion are worst); also in this case the former indicates that the test passed with a p-value less than 0.05 but greater than 0.001, whereas in the latter case the p-value was less than 0.001;

  • \(\approx \)” indicates that the two methods are equivalent (i.e. there is no statistically significant difference).

From the table different observations can be derived. First, as expected, the transductive version (version 2) of our criterion is almost always slightly better than version 1; this interesting result is possible thanks to the unsupervised nature of the proposed criterion. Reasonably, this does not hold if the dataset is large enough (as for Zongker, Polydish57 and Polydism57). Second, non linearly preprocessing the dissimilarity matrix by choosing the parameter with our criterion almost always results in a statistically significant improvement in the classification performances with respect to the original space. The only exceptions are for the protein and the Polydism57 datasets, for which, however, an almost zero error was already achieved in the original space, leaving small room for improvements. This is coherently true for both version 1 and version 2. Finally, the proposed criterion also compares reasonably well with the cross validation approach: if we consider the version 2, in 11 cases out of 17 our results are better or equivalent (in 8 cases they are significantly better), whereas only in 6 cases they are worst. In these latter cases, however, degradations are very small: \(\approx \)0.01 for CoilYork, WoodyPlants50, Polydism57 and Polydish57, \(\approx \)0.004 for DelftGestures, and \(\approx \)0.001 for ChickenPieces. We are convinced that these represent really promising results, also considering that our criterion is completely unsupervised.

5 Conclusions

In this paper a novel unsupervised criterion to tune the parameter of the power transformation (non-linear scaling) of dissimilarities has been proposed. The new tuning criterion is based on a trade-off between the median dispersion per dimension in the dissimilarity space (measured in terms of the quartile coefficient of dispersion) and the intrinsic dimension of the resulting dissimilarity space. The idea behind our approach is that a good performance of the nearest neighbor classifier in the dissimilarity space is associated to such a compromise between how much we shrink the data at the cost of increasing the intrinsic dimensionality – the shrinking is desirable because, by reducing the range, we can potentially reduce the influence of the outliers since we are largely reducing high distances (i.e. the distances to – possible – outliers) more than reducing short distances.

The proposed criterion is unsupervised and, therefore, can be even applied in a transductive learning setting. Empirical results on many different datasets partially support our intuitions. As a future work, we would like to study the properties of the proposed criterion also in classical feature based problems [4, 5, 11, 17]. Moreover, we aim at providing a more formal – theoretical or numerical – explanation: one possibility is to try to bridge our experimental evidence with the theory on Hadamard powers [9].