Abstract
We present Deep Stochastic Neighbor Compression (DSNC), a framework to compress training data for instance-based methods (such as k-nearest neighbors). We accomplish this by inferring a smaller set of pseudo-inputs in a new feature space learned by a deep neural network. Our framework can equivalently be seen as jointly learning a nonlinear distance metric (induced by the deep feature space) and learning a compressed version of the training data. In particular, compressing the data in a deep feature space makes DSNC robust against label noise and issues such as within-class multi-modal distributions. This leads to DSNC yielding better accuracies and faster predictions at test time, as compared to other competing methods. We conduct comprehensive empirical evaluations, on both quantitative and qualitative tasks, and on several benchmark datasets, to show its effectiveness as compared to several baselines.
You have full access to this open access chapter, Download conference paper PDF
1 Introduction
In machine learning problems there are situations for which the massive data scale renders learning algorithms infeasible to run in a reasonable amount of time. One solution is to first summarize the data in the form of a small set of representative data points that best characterize and represent the original data, and then run the original algorithm on this subset of the data. This may be desirable due to the requirement of making a fast prediction at test time, in problems where the predictions depend on the entire training data, e.g., k-nearest neighbors (kNN) classification [4, 8] or kernel methods such as SVMs [21]. For example, in traditional kNN classification, the prediction cost for each test example scales linearly in the number of training examples, which can be expensive if the number of training examples is large. Traditional approaches to speed-up such methods usually rely on cleverly designed data structures or select a compact subset of the original data (e.g., via subsampling [4]). Although such methods may reduce the storage requirements and/or prediction time, the performance tends to suffer, especially if the original data is high-dimensional and/or noisy.
Recently [24] introduced Stochastic Neighbor Compression (SNC), which learns a set of pseudo-inputs for kNN classification by minimizing a stochastic 1-nearest neighbor classification error on the training data. Compared to the data sub-sampling approaches, SNC achieves impressive improvements in test accuracy when using these pseudo-inputs as the new training set. However, since SNC performs data compression in the original data space (or in a linearly transformed lower-dimensional space), it may perform poorly when the data in the original space are highly non-separable and noisy.
Motivated by this, we present Deep Stochastic Neighbor Compression (DSNC), a new framework to jointly perform data summarization akin to the methods like SNC, while also learning a nonlinear feature representation of the data via a deep learning architecture. Our framework is based on optimizing an objective function that is designed to learn nonlinear transformations that preserve the neighborhood structure in the data (based on label information), while simultaneously learning a small set of pseudo-inputs that summarize the entire data. Note that, due to the neighborhood preserving property, our framework can also be viewed as performing a nonlinear (deep) distance metric learning [22], while also learning a summarized version of the original data. The data summarization aspect also makes DSNC much faster than other metric learning based approaches which need all the training data. In DSNC, the data summarization and feature learning, both, are performed jointly through backpropagation [31] using stochastic gradient descent, making our framework readily scalable to large data sets. Moreover, our framework is also more general than standard feedforward neural networks which perform simultaneous feature learning and classification but are not designed to learn a summary of the data which may be useful in its own right.
In our comprehensive empirical studies, DSNC achieves superior classification accuracies on the seven datasets we used in the experiments, outperforming SNC by a significant margin. For example, with DSNC, 1-NN is able to achieve \(0.67\,\%\) test error on MNIST with only ten compressed data samples (one per class) on a 20-dimensional feature space, compared to \(7.71\,\%\) for SNC. We also report qualitative experiments (via visualization) showing that DSNC is effective in learning a good summary of the data.
2 Background
Throughout this paper, we denote vectors as bold, lower-case letters, and matrices as bold, upper-case letters. \(\Vert \cdot \Vert \) applied to a vector denotes the standard vector norm, \([\mathbf {X}]_{ij}\) means the (i, j)-th element of matrix \(\mathbf {X}\). We denote the training data \(\mathbf {X}=\{\mathbf {x}_1, ..., \mathbf {x}_N\}\), where \(\mathbf {X}\in \mathbb {R}^{D\times N}\) are N observed data samples of dimensionality D with corresponding labels \(\mathbf {Y}=\{y_1, ..., y_N\} \in \mathcal {Y}^N\), with \(\mathcal {Y}\) as a discrete set of possible labels.
To motivate our proposed framework DSNC (described in Sect. 3), we first provide an overview of Neighborhood Components Analysis (NCA) [16, 32] and Stochastic Neighbor Compression (SNC) [24], which our proposed framework builds on.
2.1 Neighborhood Components Analysis
Neighborhood Components Analysis (NCA) [16] is a distance metric learning method that learns a mapping \(f(\cdot |\mathbf {W})\) with parameters \(\mathbf {W}\) to optimize the k-nearest-neighbors classification objective. The optimization is based on preserving the Euclidean distance \(d_{ij}=||f(\mathbf {x}_i) - f(\mathbf {x}_j)||^2\) in the transformed space for \(\mathbf {x}_i\) and \(\mathbf {x}_j\), based on their original neighborhood relationship in the original space. Specifically, soft neighbor assignments are used in NCA to directly optimize the mapping f for \(k\)NN classification performance. The probability \(p_{ij}\) that \(\mathbf {x}_i\) is assigned to \(\mathbf {x}_j\) as its stochastic nearest-neighbor is modeled with a softmax over distances between \(\mathbf {x}_i\) and the other training samples, i.e., \(p_{ij} = \frac{\exp (-d_{ij})}{\sum _{k:k\ne i}\exp (-d_{ik})}\). The objective of NCA is to maximize the expected number of correctly classified points, expressed here as a log-minimization problem: \(\hat{\mathbf {W}} = \arg \min _{\mathbf {W}} -\sum _{i=1}^{N}\log (p_i)\), where \(p_i\) is the probability that the mapped sample \(f(\mathbf {x}_i|\mathbf {W})\) is correctly classified with label \(y_i\), i.e., \(p_i = \sum _{j:y_i=y_j}p_{ij}\). Although NCA can learn a distance metric adaptively from data, the entire training data still needs to be stored, making it computationally and storage-wise expensive at test time. To extend NCA with nonlinear transformations, [32] defines \(f(\cdot | \mathbf {W})\) to be a feedforward neural network parameterized by weights \(\mathbf {W}\).
2.2 Stochastic Neighbor Compression (SNC)
Stochastic Neighbor Compression (SNC) is an improvement over NCA by learning a compressed \(k\)NN training set by optimizing a soft neighborhood objective [24]. The goal in SNC is to find a subset of \(m\!\ll \!N\) compressed samples \(\mathbf {Z}=[\mathbf {z}_1, ..., \mathbf {z}_m]\) with labels \(\hat{\mathbf {Y}} = [\hat{y_1}, ..., \hat{y_m}]\), to best approximate the \(k\)NN decision rule on the original set of training samples \(\mathbf {X}\) and labels \(\mathbf {Y}\). Different from NCA, a compressed set \(\mathbf {Z}\) needs to be learned from the whole data. The objective is to maximize the stochastic nearest-neighbor accuracy with respect to \(\mathbf {Z}\), i.e., \(\hat{\mathbf {Z}} = \arg \min _{\mathbf {Z}} -\sum _{i=1}^{N}\log (p_i)\), where the probability of a correct assignment between a training sample \(\mathbf {x}_i\) and the compressed neighbors \(\mathbf {z}_i\) is defined as \(p_i = \sum _{j:y_i=y_j} \frac{\exp (-\gamma ^2||\mathbf {x}_i - \mathbf {z}_j ||^2)}{\sum _{k=1}^m\exp (-\gamma ^2||\mathbf {x}_i - \mathbf {z}_k||^2)}\), where \(\gamma \) is the width of the Gaussian kernel. Given such probabilities, the objective of SNC is constructed as in the case of NCA and is optimized w.r.t. the m pseudo-inputs \(\mathbf {Z}\). In [24], a linear metric learning extension of this approach was also considered, which defines \(p_i = \sum _{j:y_i=y_j} \frac{\exp (||-\mathbf A (\mathbf {x}_i - \mathbf {z}_j) ||^2)}{\sum _{k=1}^m\exp (-\gamma ^2||-\mathbf A (\mathbf {x}_i - \mathbf {z}_k)||^2)}\), in which the pseudo-inputs will be learned in the linearly transformed space. However, in the case of noisy and highly non-separable data sets, the linear transformation may not be able to learn a good set of pseudo-inputs. Our proposed framework, on the other hand, is designed to learn these pseudo-inputs, while simultaneously learning a nonlinear feature representation for these.
3 Deep Stochastic Neighbor Compression
Our proposed framework Deep Stochastic Neighbor Compression (DSNC) is based on the idea of summarizing/compressing data in a nonlinear feature space learned via a deep feedforward neural network. Although methods like SNC (Sect. 2.2) can achieve a significant data compression, the inferred pseudo-inputs \(\mathbf {Z}\) still belong to the original feature space, or a linear subspace of the original data. In contrast, DSNC learns \(\mathbf {Z}\) in a more expressive, nonlinear feature space. Note that, in our framework, nonlinear feature learning naturally corresponds to a nonlinear (deep) metric learning.
DSNC consists of a deep feedforward neural network architecture which jointly learns a compressed set \(\mathbf {Z} \in \mathbb {R}^{d\times m}\) with m pseudo-inputs (\(m\ll N\)), along with a deep feature mapping \(f(\cdot | \mathbf {W})\) from the original feature space \(\mathbb {R}^D\) to a transformed space \(\mathbb {R}^d\). The procedure is illustrated in Fig. 1. The set \(\mathbf {Z}\) consisting of the inferred pseudo-inputs and the deep feature representation \(f(\cdot | \mathbf {W})\) are used as a reference set and feature transformation, respectively, at test-time of an instance based method such as \(k\)NN classification. In the following we describe the key components of DSNC.
3.1 Deep Stochastic Reference Set
Let \(f(\cdot | \mathbf {W})\!:\!\mathbb {R}^D\!\rightarrow \!\mathbb {R}^d\) be a deep neural network mapping function, with \(\mathbf {W}\) as the set of parameters from all layers of the network.Footnote 1 Similar to SNC, we aim to learn a compressed set of pseudo-inputs, \(\mathbf {Z} = [\mathbf {z}_1, \cdots , \mathbf {z}_m]\) with \(\mathbf {z} \in \mathbb {R}^{d}\), such that \(\mathbf {Z}\) summarizes the original training set in the deep feature space. To this end, akin to SNC, we define the probability that input \(\mathbf {x}_i\) chooses \(\mathbf {z}_j\) as its nearest reference vector as:
In the optimization, in addition to learning the parameters of a deep neural network, the compressed set \(\mathbf {Z}\) is also learned from the data. This is done by first initializing \(\mathbf {Z}\) with m randomly sampled examples from \(\mathbf {X}\), noted as \(\mathbf {X^{'}}\), and then computing their deep representation via \(\mathbf {Z} = f(\mathbf {X^{'}})\), while recording their original labels. Note that while learning f and \(\mathbf Z \), these labels are fixed throughout, while \(\mathbf {Z}\) and the parameters \(\mathbf {W}\) of the deep mapping f are learned jointly with the objective defined below.
3.2 DSNC Objective
To define an objective function for DSNC, we would like to ensure \(p_i\triangleq \sum _{j:y_i=y_j}p_{ij} =1\) for all \(\mathbf {x}_i \in \mathbf {X}\), where \(p_{ij}\) is defined in (1). This means that the probability \(p_{ij}\) corresponding to an input \(\mathbf {x}_i\) and a pseudo-input \(\mathbf {z}_j\), both having different labels, is zero. We then define the KL-divergence between the “perfect” distribution “1” and \(p_i\) as
We wish to find a compressed set \(\mathbf {Z}\) such that as many training inputs as possible are classified correctly in the deep feature space. In other words, we would like \(p_i\) to be close to 1 for all \(\mathbf {x}_i \in \mathbf {X}\). This leads to the following objective:
where \(\mathbf {W}\) denotes the parameters of the deep feedforward neural network.
There are two possible issues that may arise while optimizing the objective (3) for DSNC and need to be properly accounted for. First, since we are jointly learning the deep feature map f and the compressed set \(\mathbf {Z}\), without any constraints, it is possible that the mapped samples \(f(\mathbf {x}_i)\) are on a different scale than the compressed samples \(\mathbf {Z}\) in the deep feature space, while achieving a small value for the objective function (3). To handle this issue, we encourage the distance between \(f(\mathbf {x}_i)\) and \(\mathbf {z}_j\) to be small to avoid an inhomogeneous distribution in the feature space.
Second, it is also possible that all the compressed data samples with the same label collapse into a single point since our objective aims to maximize the classification accuracy. As a result, we also penalize the distribution of the compressed samples to encourage a multi-modal distribution for each label. This is done by maximizing the pair-wise distance between two pseudo-inputs \(\mathbf {z}_i\) and \(\mathbf {z}_j\) with the same label. Consequently, the DSNC objective function combines the KL-divergence term \(\tilde{\mathcal {L}}(\mathbf {Z}, \mathbf {W})\) with two additional regularization terms to account for these, and is given by
where \(\lambda _1\) and \(\lambda _2\) are regularization coefficients and the delta function \(\delta (\hat{y_i},\hat{y_j})\!=\!1\) if \(\hat{y_i}\!=\!\hat{y_j}\), and 0 otherwise. \(\{\hat{y_i}\}\) are the labels for the compressed set \(\mathbf {Z}\). \(R_1\) regularizes the compressed samples to be close to the training data in the deep feature space, while \(R_2\) encourages compressed samples with the same label to dissociate.
3.3 Learning with Stochastic Gradient Descent
The objective function (4) can be easily optimized via the back-propagation algorithm with stochastic gradient descent [7]. We adopt the RMSProp algorithm [35].
Specifically, there are two components that need to be updated: the parameters \(\mathbf {W}\) of the deep neural network, and the compressed set \(\mathbf {Z}\). Parameters \(\mathbf {W}\) are updated by back-propagation, which requires the gradient of the objective with respect to the output \(f(\mathbf {X})\), which is then back-propagated down the neural network. The compressed set \(\mathbf {Z}\) can be simply updated with a stochastic gradient descent step. The stochastic gradients for both \(\mathbf {Z}\) and \(f(\mathbf {X})\) have simple and compact forms. To write down the gradients, we first define the following matrices \(\{\mathbf {Q}, \mathbf {P} , \hat{\mathbf {P}}\}\in \mathbb {R}^{n\times m},\mathbf {Q}_1\in \mathbb {R}^{m\times m}, \mathbf {P}_1 \in \mathbb {R}^{d\times n}\), and \(\{\mathbf {P}_2, \mathbf {Q}_2\}\in \mathbb {R}^{d\times m}\) as
Here, \(p_{ij}\) is defined in (1), \(x_{ij}\) and \(z_{ij}\) denote the corresponding elements of row/column i / j in X / Z. After some careful algebra, the gradient of \(\mathcal {L}\) with respect to the compressed set \(\mathbf {Z}\) and \(f(\mathbf {X})\) can then be conveniently represented in matrix operations with the above defined symbols, i.e.,
where \(\circ \) is the Hadamard (element-wise) product, \(\mathbf {1}_n\) is the \(n\times 1\) vector of all ones, and \(\text{ diag } (\cdot )\) is the diagonal operator placing a vector along the diagonal of an otherwise 0 matrix. Given the gradients, learning is straightforward by applying the RMSProp algorithm on \(\mathbf {Z}\) and the back-propagation for learning W, described in Algorithm 1.
3.4 Relationship with Deep Neural Net with Softmax Output
We now show how DSNC is related to a deep neural network with a softmax output. Note they are comparable only when \(m=|Y|\), i.e., the number of pseudo-inputs is equal to the number of classes. Note that, for a deep neural network with a softmax output, the corresponding probability for (1) can be written as
Note that the Euclidean distance in DSNC is replaced by an inner product in softmax function above. When \(\gamma ^2 = \frac{1}{2}\) and \(||f(\mathbf {x}_i)||_2^2 = ||\mathbf {z}_j||^2_2=1\), the probability that \(\mathbf {x}_i\) belongs to “class” \(\mathbf {z}_j\), as given by (1) can be written as
which exactly recovers the softmax output. Therefore, a deep neural network with a softmax output can be viewed as a special case of our DSNC framework.
4 Related Work
Our work is aimed at improving the accuracies of instance based methods, such as kNN, by learning highly discriminative feature representations (equivalently, learning a good distance metric), while also speeding up the test-time predictions. It is therefore related to both feature/distance-metric learning algorithms, as well as data summarization/compression algorithms for instance based methods.
In the specific context of kNN methods, there have been several previous efforts to speed up kNN’s test-time predictions. The vast majority of these methods reduce to speeding up the retrieval of k nearest neighbors without modifying the training set. These include space partition algorithms such as ball-trees [6, 30] and kd-trees [5], as well as approximate neighbor search like local-sensitive hashing [3, 14]. Our paper addresses the problem from the perspective of data compression that reduces the size of the training set. Note that data compression approach is orthogonal to prior efforts on fast retrieval approaches, and thus these two methodologies could be combined.
Perhaps the most straightforward idea for data compression is subsampling the dataset. The seminal work in this area is Condensed Nearest Neighbors (CNN) proposed by [18]. It starts off with two sets, S and T, where S contains an instance of the training set and T contains the rest. CNN repeatedly scans T, looking for an instance in T that is misclassified using the data in S. This instance is then moved from T to S. This process continues until no more data movement can be made. Since this work, there have been several variants of CNN, including MCNN to address the order dependent issue of CNN [11], post-processing method [13], and fast CNN (FCNN) [4]. With these methods, the compressed training set is always a subset of the original training set, which is not necessarily a good representation. Recently, [24] introduce Stochastic Neighbor Compression (SNC), which learns a synthetic set as the compressed set. Assuming the synthetic set is presented as the design variables, SNC uses stochastic neighborhood [16, 20, 26] to model the probability of each training instance being correctly classified by the synthetic set. The synthetic set is obtained through numerical optimization, where the objective is to minimize the KL-divergence between the modeled distribution and the “perfect” distribution in which all training instances are correctly classified.
Among other works on summarizing/compressing massive data sets for machine learning problems includes methods such as coresets [1] for geometric problems (e.g., k-means/k-median clustering, nearest neighbor methods, etc.). Kernel methods are also known to have the problem of having to store the entire training data in the memory and being slow at test time, and several methods such as landmarks based approximations [21, 40] have been proposed to address these issues. However all these methods can only perform data compression by learning a set of representatives in the original feature space, and are not suited for data sets that are high-dimensional and exhibit significant nonlinearities.
All of the above methods operate on the original data space, not embracing the superior expressive power of deep learning. With unprecedented generalization performance, deep learning has achieved great success in various important applications, including speech recognition [17, 19, 29], natural language processing [9, 27, 28], image labeling [12, 23, 34, 39], and object detection [15, 36]. Recently, the kNN classifier has been equipped with deep learning in modern face recognition systems, such as FaceNet [33]. In particular, kNN performs classification on the space mapped by a convolutional net [25]. However, Facenet trains the convolutional net to reflect the actual similarity between images/faces, rather than the accuracy performance of kNN. [32] introduce a method to train a deep neural net for kNN to perform well on the transformed space. Though inspired by this work, our DSNC is fundamentally different in that it not only optimizes the kNN performance but also simultaneously learns a compressed set in a new nonlinear feature space learned by a feedforward deep neural network.
5 Results
We present experimental results on seven benchmark datasets, including four from [24], i.e., mnist, yaleface, isolet, adult; and three additional, more complex datasets, i.e., 20news, cifar10 and cifar100. Some statistics are listed in Table 1. Since yaleface has no predefined test set, we report the average performance over 10 splits. All other results are reported on predefined test sets. We begin by describing the experimental settings, and then evaluate the test errors, compression ratios, feature representations, sensitivity to hyper-parameters and visualization of distributions of the test sets in the deep feature space. Our code is publicly available at http://people.duke.edu/~ww107/.
5.1 Experimental Setting
To explore the advantages of our deep-learning-based method, we use raw features as the input for DSNC and the corresponding reference deep neural networksFootnote 2. For mnist, yaleface (rescaled to \(48\times 42\) pixels [37]), cifar10 and cifar100, we adopt convolutional neural networks, while isolet, adult and 20news are fitted with feed-forward neural networks. ReLU is adopted as the activation function after hidden layers for all models. Details of the network structures are shown in Table 2. When comparing the error with varying compressed ratios in Sect. 5.3, we fix d in Hd to be \(d_{L}\) in Table 1, and the time comparing the error with varying dimensions in Sect. 5.4, we keep the compared size m to be \(N_{max}\).
DSNC is implemented using Torch7 [10] and trained on NVIDIA GTX TITAN graphics cards with 2688 cores and 6 GB of global memory. We verify the implementation by numerical gradient checking, and optimize using stochastic gradient descent with RMSprop, using mini-batch in size of 100. For all the datasets, we randomly select \(20\,\%\) of the training data for cross-validation of hyper-parameters \(\lambda _1\) and \(\lambda _2\) and early stopping. In contrast to SNC, our DSNC is not sensitive to \(\gamma \). Thus we use a constant value 1 for all DSNC experiments set up.
With SNC we follow a similar setup to [24]. For isolet and mnist, the dimensionality is reduced with LMMN as described in [38]. For yaleface, we follow [38] and first rescale the images to \(48\times 42\) pixels, then reduce the dimensionality with PCA, while omitting the leading five principal components which largely account for lighting variations. Finally we apply large margin nearest neighbor (LMNN) to reduce the dimensionality further to \(d=100\). For cifar10 and cifar100, we use LMNN to reduce the dimensionality to \(d=200\). In fact, the dimensionality of SNC is determined by LMNN. The parameters used for comparing the test error with varying compression rates and dimensionality are exactly the same as DSNC as we described before. Parameters are listed in Table 1. Notice that LMNN is used as the pre-processing step for all the methods except our DSNC and the corresponding reference networks.
5.2 Baselines
We experiment with two versions of DSNC, one uses the compressed data \(\mathbf Z \) as the kNN reference during testing, denoted by Compression, the other uses the entire training data, denoted by ALL. We compare DSNC against the following related baselines, where the 1-nearest neighbor rule is adopted for all kNN methods.
-
kNN without compression, with/without dimensionality reduction with LMNN;
-
kNN using Stochastic Neighbor Compression (SNC) [24];
-
Approximate kNN with Locality-Sensitive Hashing (LSH) [2, 14];
-
Deep neural network classifier with the same network structure as DSNC.
5.3 Errors with Varying Compression Ratios
In this section we experiment with varying compressed ratio of the dataset, defined as the ratio between the compressed data size and the whole data size. The results are plotted in Fig. 2. Several conclusions can be drawn from the results: (1) DSNC outperforms other methods on all data sets. The gap between DSNC and SNC is huge for all the data sets, which indicates the advantage of learning the nonlinear feature space for data compression. (2) DSNC emerges as a stable compression method that is robust to the compression ratio. This is especially true when the compression ratio is small, for example, when the compression data size equals the number of classes (\(m = |Y|\)), DSNC still performs well, yielding significantly lower errors than LSH, CNN, FCNN and SNC. And, generally, with increasing m, test errors tend to decrease to a certain degree. (3) Compared with reference deep neural networks with softmax outputs, DSNC exhibits better performances on most datasets except adult, but with smaller gaps than the other methods. A possible reason could be that the task is binary classification and multi-modality within class distributions may be not that explicit in the dataset. It is notable when \(m=|Y|\), DSNC degrades to the reference neural network using Euclidean distance as the metrics in softmax. We can see on 20news, cifar10 and cifar100, the reference neural networks perform better. However, with an adaptive m, DSNC can always surpass the reference neural networks; while the observation on YaleFace is particular surprising, as there is a big performance gap between DSNC and the corresponding convolutional neural networks. This indicates our motivation of learning a representative feature space for data compression to be effective, as DSNC has more degrees of freedom to adapt the compression data to a weak feature presentation.
5.4 Errors with Varying Feature Dimensions
Next we investigate the impact of feature dimensions on the classification accuracy. To test the adaptive ability of DSNC to extremely low-dimensional feature spaces, we vary the feature space dimensions from 10 to 300 on CIFAR10 and CIFAR100, and from 10 to 100 on the other datasets. The results are plotted in Fig. 3. We can see from the figure that the performance does not deteriorate when learning with a deep nonlinear transformation, i.e., DSNC and DNN/CNN yield almost the same test errors with different feature dimensions on all the datasets, while other methods produce significantly worse performance when the feature dimension is low. Particularly, for MNIST and Isolet, a 20 dimensional space is found to be powerful enough to express the dataset, while for CIFAR100, a nonlinear transformation into a 100-dimensional space obtains an accuracy that is close to the optimal performance. Interestingly, we also notice that using the compressed data outperforms the one using the entire mapped data. This is because our objective optimizes directly on the compressed data set, which can effectively filter out the noise in the original data set consisting of all the observations.
5.5 Sensitivity to Hyper-parameters
In contrast to SNC, it is found that our model is not sensitive to the parameter \(\gamma \) in the stochastic neighborhood term. However, the hyper-parameters \(\lambda _1\) and \(\lambda _2\) do influence the performance of DSNC, because they control different behaviors of the objective. Specifically, \(\lambda _1\) tends to pull the compressed data closer to the training sets in the deep feature space, while \(\lambda _2\) pushes the compressed data with the same label to be far away from each other, such that they do not collapse into a single point and tend to capture the within-class multi-modality. We visualize the effects for these hyper-parameters by embedding the compressed data into 2-dimensional space using tSNE [26]. We use the 20news dataset for visualization in Fig. 4. Consistent with our intuition, we find that with increasing \(\lambda _1\), the compressed data tends to be condense, and far way from the training data; while increasing \(\lambda _2\) generally pushes the compressed data in the same class to distribute more scatteringly. This indicates that if we want a larger compressed set of pseudo-inputs (i.e., m is large), a larger value of \(\lambda _2\) should be set. The accuracies with different values of \(\lambda _1\) and \(\lambda _2\) are summarized in Table 3, which indicates suitable choices for \(\lambda _1\) and \(\lambda _2\) is essential for good performance.
5.6 Comparison of DSNC with SNC and SOFTMAX
In order to further understand the advantage of DSNC over SNC and softmax-based deep neural networks (SOFTMAX), we visualize them on MNIST. We adopt the same models as the above experiments with a reference set consisting of \(m=100\) pseudo-inputs. This gives us cleaner results in the visualization. The inferred pseudo-inputs in the feature space are plotted in Fig. 5. It can be clearly seen that DSNC is able to learn both separable feature space and representative data, whereas for the SNC, the compressed data does not seem to be separable. In terms of SOFTMAX, even though it can learn centered clusters, its tendency to only learn unimodal within-class distributions lead to poor performance around the decision boundary.
6 Conclusion
We propose DSNC to jointly learn a deep feature space and a subset of compressed data that best represents the whole data. The algorithm consists of a deep neural network component for feature learning, on top of which an objective is proposed to optimize the kNN criteria, leading to a natural extension of the popular softmax-based deep neural networks. We test DSNC on a number of benchmark datasets, obtaining significantly improved performance compared to existing data compression algorithms.
Notes
- 1.
For conciseness, we will typically omit the parameters \(\mathbf {W}\) from the notation for the mapping function, i.e., \(f(\cdot ) \triangleq f(\cdot | \mathbf {W})\).
- 2.
The same network structure as DSNC except with a softmax-output.
References
Agarwal, P.K., Har-Peled, S., Varadarajan, K.R.: Geometric approximation via coresets. Comb. Comput. Geom. 52, 1–30 (2005)
Aly, M., Munich, M., Perona, P.: Indexing in large scale image collections: scaling properties and benchmark. In: 2011 IEEE Workshop on Applications of Computer Vision (WACV), pp. 418–425. IEEE (2011)
Andoni, A., Indyk, P.: Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. In: 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS 2006), pp. 459–468. IEEE (2006)
Angiulli, F.: Fast condensed nearest neighbor rule. In: Proceedings of the 22nd International Conference on Machine Learning, pp. 25–32. ACM (2005)
Bentley, J.L.: Multidimensional binary search trees used for associative searching. Commun. ACM 18(9), 509–517 (1975)
Beygelzimer, A., Kakade, S., Langford, J.: Cover trees for nearest neighbor. In: Proceedings of the 23rd International Conference on Machine Learning, pp. 97–104. ACM (2006)
Bottou, L.: Stochastic gradient descent tricks. Tech. rep, Microsoft Research, Redmond, WA (2012)
Cayton, L.: Fast nearest neighbor retrieval for bregman divergences. In: Proceedings of the 25th International Conference on Machine Learning, pp. 112–119. ACM (2008)
Chen, W., Grangier, D., Auli, M.: Strategies for training large vocabulary neural language models (2015). arXiv preprint arXiv:1512.04906
Collobert, R., Kavukcuoglu, K., Farabet, C.: Torch7: a matlab-like environment for machine learning. In: BigLearn, NIPS Workshop. No. EPFL-CONF-192376 (2011)
Devi, V.S., Murty, M.N.: An incremental prototype set building technique. Pattern Recogn. 35(2), 505–513 (2002)
Donahue, J., Jia, Y., Vinyals, O., Hoffman, J., Zhang, N., Tzeng, E., Darrell, T.: Decaf: a deep convolutional activation feature for generic visual recognition (2013). arXiv preprint arXiv:1310.1531
Gates, G.: The reduced nearest neighbor rule. IEEE Trans. Inf. Theory 18(3), 431–433 (1972)
Gionis, A., Indyk, P., Motwani, R., et al.: Similarity search in high dimensions via hashing. VLDB 99, 518–529 (1999)
Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation. In: CVPR (2014)
Goldberger, J., Hinton, G.E., Roweis, S.T., Salakhutdinov, R.: Neighbourhood components analysis. In: Advances in neural information processing systems, pp. 513–520 (2004)
Graves, A., Mohamed, A.R., Hinton, G.: Speech recognition with deep recurrent neural networks. In: ICASSP (2013)
Hart, P.E.: The condensed nearest neighbor rule. IEEE Trans. Inf. Theory 14, 515–516 (1968)
Hinton, G., Deng, L., Yu, D., Dahl, G.E., Mohamed, A.R., Jaitly, N., Senior, A., Vanhoucke, V., Nguyen, P., Sainath, T.N., et al.: Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups. IEEE Signal Process. Magaz. 29(6), 82–97 (2012)
Hinton, G.E., Roweis, S.T.: Stochastic neighbor embedding. In: Advances in neural information processing systems, pp. 833–840 (2002)
Hsieh, C.J., Si, S., Dhillon, I.S.: Fast prediction for large-scale kernel machines. In: Advances in Neural Information Processing Systems, pp. 3689–3697 (2014)
Hu, J., Lu, J., Tan, Y.P.: Discriminative deep metric learning for face verification in the wild. In: CVPR (2014)
Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: NIPS (2012)
Kusner, M., Tyree, S., Weinberger, K.Q., Agrawal, K.: Stochastic neighbor compression. In: Proceedings of the 31st International Conference on Machine Learning (ICML 2014), pp. 622–630 (2014)
LeCun, Y., Boser, B., Denker, J.S., Henderson, D., Howard, R.E., Hubbard, W., Jackel, L.D.: Backpropagation applied to handwritten zip code recognition. Neural Comput. 1(4), 541–551 (1989)
Van der Maaten, L., Hinton, G.: Visualizing data using t-sne. J. Mach. Learn. Res. 9(2579–2605), 85 (2008)
Mikolov, T., Karafiát, M., Burget, L., Cernockỳ, J., Khudanpur, S.: Recurrent neural network based language model. In: 11th Annual Conference of the International Speech Communication Association (INTERSPEECH 2010), Makuhari, Chiba, Japan, 26–30 September 2010, pp. 1045–1048 (2010)
Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: NIPS (2013)
Mohamed, A.r., Sainath, T.N., Dahl, G., Ramabhadran, B., Hinton, G.E., Picheny, M.A.: Deep belief networks using discriminative features for phone recognition. In: ICASSP (2011)
Omohundro, S.M.: Five balltree construction algorithms. International Computer Science Institute Berkeley (1989)
Rumelhart, D.E., Hinton, G.E., Williams, R.J.: Learning internal representations by error propagation. Tech. rep, DTIC Document (1985)
Salakhutdinov, R., Hinton, G.E.: Learning a nonlinear embedding by preserving class neighbourhood structure. In: International Conference on Artificial Intelligence and Statistics, pp. 412–419 (2007)
Schroff, F., Kalenichenko, D., Philbin, J.: Facenet: a unified embedding for face recognition and clustering (2015). arXiv preprint arXiv:1503.03832
Sermanet, P., Eigen, D., Zhang, X., Mathieu, M., Fergus, R., LeCun, Y.: Overfeat: Integrated recognition, localization and detection using convolutional networks (2013). arXiv preprint arXiv:1312.6229
Tieleman, T., Hinton, G.E.: Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude. Tech. rep., Coursera: Neural Networks for Machine Learning (2012)
Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: a neural image caption generator. arXiv preprint arXiv:1411.4555 (2014)
Weinberger, K., Dasgupta, A., Langford, J., Smola, A., Attenberg, J.: Feature hashing for large scale multitask learning. In: ICML (2009)
Weinberger, K.Q., Blitzer, J., Saul, L.K.: Distance metric learning for large margin nearest neighbor classification. In: Advances in neural information processing systems, pp. 1473–1480 (2005)
Zeiler, M.D., Fergus, R.: Visualizing and understanding convolutional networks. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014, Part I. LNCS, vol. 8689, pp. 818–833. Springer, Heidelberg (2014)
Zhang, K., Tsang, I.W., Kwok, J.T.: Improved nyström low-rank approximation and error analysis. In: Proceedings of the 25th International Conference on Machine Learning, pp. 1232–1239. ACM (2008)
Acknowledgments
This research was supported in part by ARO, DARPA, DOE, NGA and ONR.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer International Publishing AG
About this paper
Cite this paper
Wang, W., Chen, C., Chen, W., Rai, P., Carin, L. (2016). Deep Metric Learning with Data Summarization. In: Frasconi, P., Landwehr, N., Manco, G., Vreeken, J. (eds) Machine Learning and Knowledge Discovery in Databases. ECML PKDD 2016. Lecture Notes in Computer Science(), vol 9851. Springer, Cham. https://doi.org/10.1007/978-3-319-46128-1_49
Download citation
DOI: https://doi.org/10.1007/978-3-319-46128-1_49
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-46127-4
Online ISBN: 978-3-319-46128-1
eBook Packages: Computer ScienceComputer Science (R0)