Toward efficient indexing structure for scalable content-based music retrieval
- 539 Downloads
With advancement of various information processing and storage techniques, the scale of digital music collections has been growing at very fast speed during recent decades. To support high-quality content-based retrieval over such a large volume of music data, how to develop indexing structure with good effectiveness, efficiency and scalability becomes an important research issue. However, existing techniques mainly focus on improving query efficiency. Very few approaches have been proposed to address issues related to scalability and accuracy. In this study, we address the problem via introducing a novel indexing technique called effective music indexing framework (EMIF) to facilitate scalable and accurate music retrieval. It is designed based on a “classification-and-indexing” principle and consists of two main functionality modules: (1) music classification—a novel semantic-sensitive classification to identify an input song’s category and (2) indexing module—multiple local indexing structures, one for each semantic category to reduce query response time significantly. In particular, the classification model combining linear discriminative mixture model (LDMM) and advanced score fusion scheme has been applied to estimate category of music accurately. Layered architecture enables EMIF to enjoy superior scalability and efficiency. To evaluate the approach, a set of experimental studies has been carried out using two large music test collections and the results demonstrate various advantages of EMIF over state-of-the-art approaches including efficiency, scalability and effectiveness.
KeywordsMultimodal Indexing Content-based music retrieval Efficiency Scalability
Recent years have witnessed a fast growth in digital multimedia data from various real application domains (e.g., online streaming service, education and entertainment) [1, 2, 3, 4, 5, 6, 7]. To achieve fast and reliable access on such large volume of multimedia data, efficiency becomes an important issue and an intelligent indexing structure is essential to scale the data space. Particularly, advances in technologies such as networks, cloud storage and mobile device boosted volume increase of enormous music data in different formats. For example, according to Nelsen market report, on-demand song streaming volume is up 45%, having already exceeded 268 billion in 2018. In response to the needs for tools to fast access such large size of music information, different kinds of indexing methods have been recently proposed to support efficient content-based music information retrieval (CBMIR) and analysis during the last decades [8, 9, 10, 11, 12, 13, 14]. The specific examples include the CM*F , QUC-tree , LSH-based approaches [17, 18, 19] and so on. In general, most of them are designed based on the principle called “feature transformation”, which has been emerging as an important search paradigm. The basic idea is to extract the low-level acoustic features (usually in the form of a multidimensional feature vector) from each music document in the database and then to map the features into points in a high-dimensional feature space as signature. The distance between two feature points is frequently used as a measure of similarity between two audio files. Once the distance or similarity function is defined in the feature space, a nearest neighbor search can be used to retrieve the objects that satisfy the criteria specified in a given query.
While existing approaches are efficient in some specialized music IR and database applications [19, 20], many open problems still remain unsolved. First, good scalability and low cost-of-maintenance are essential to modern music information retrieval systems whose contents could easily be huge and updated frequently. Notice that the rebuilding cost for existing indexing structure is directly related to the data size. Unfortunately, relatively little attention has been paid on improving performance in this direction and associated update operation generally results in very expensive computational cost. Further, efficiency of query processing (e.g., response time or system reconstruction time) based on the existing approaches could be decreased dramatically when the size of music collections becomes larger and larger. Moreover, recently proposed indexing structures (e.g., M-tree, Hybrid tree, \(\varDelta\)-tree, QUC-tree and LSH) focus primarily on improving query efficiency but generally ignore the quality of retrieval results. In fact, due to well-known “semantic gap”, accurate query processing cannot be achieved using indexing structure constructed based on low-level features only . In developing comprehensive music content descriptors for accurate similarity retrieval, we need to combine low-level feature to produce more effective music signature. This introduces two correlated sub-problems: (1) how should the various low-level features be fused for particular search task and (2) how can the combined feature be compact enough to enable fast search and classification using existing indexing algorithms or machine learning methods. Naturally, raw acoustic feature vectors have high dimensions (e.g., some of them can have up to 100 dimensions) and creating a generalized high-dimensional index that can handle hundreds of dimensions is still an unsolved problem to date . This is because many existing indexing methods have an exponential time and space complexity as the number of dimensions increases. When indexing high-dimensional vectors, they will not perform better than sequential scanning of the database. Moreover, existing study generally ignores scalability issue of indexing structure, which is crucial for retrieval and management of large-scale music databases. This is due to the fact that such systems can potentially contain thousands of audio files for retrieval and the contents of the data collections could be changed frequently. The associated cost could be extremely high. Motivated by the concerns, several dimensional reduction methods were proposed to generate smaller content representation to improve efficiency and effectiveness. However, they still suffer from poor scalability—expensive update cost or/and low effectiveness—less comprehensive content representation .
1.2 Core technical contributions
We develop multiple-feature-based music class profiling model to characterize different music categories. In terms of functionality, it is a probabilistic classifier to estimate correct label of input music. The scheme can effectively combine multiple features to enhance categorization effectiveness and thus improve the overall retrieval accuracy greatly.
Distinguished from previous approaches, EMIF’S architecture is designed based on a “Classify-and-Indexing” principle and applies a multiple-layer structure, which consists of two basic components—classification module and indexing module. This innovation enables superior scalability, efficiency and significantly reduces system reconstruction cost, which is a major overhead for existing solutions.
We develop a novel deep learning-based music signature generation scheme called DMSG to compute compact and comprehensive music descriptor—deep music signature (DMS). The approach can effectively combine various kinds of acoustic features to produce small feature vector to enhance the indexing and retrieval based on the existing access methods.
We conduct a set of detailed experimental studies and result analysis based on three large test collections. It demonstrates that EMIF enjoys superior scalability, effectiveness and efficiency over the existing approaches.
2 Related work
In this section, we mainly focus on introducing previous work and background knowledge related to CBMIR. In Sect. 2.1, we survey the existing approaches of multidimensional indexing structures. Then, Sect. 2.2 briefly overviews the prior work about how to model music signal and generate music content descriptor.
2.1 Multidimensional indexing structure
The first relevant stream of literature is about developing high-dimensional access methods (e.g., indexing tree and dimension reduction). To support fast similarity search in high-dimensional databases, various schemes have been proposed in recent decades . The typical examples include M-tree , the VA-file , Hybrid tree , the iDistance  and Hashing [17, 29, 30, 31, 32, 33]. In , the authors proposed the height-balanced M-tree to organize and search large datasets from a generic metric space, where object proximity is defined by a distance function satisfying the positivity, symmetry and triangle inequality postulates. The strength of the M-tree lies in maintaining the pre-computed distance in the index structure; however, it still suffers from the dimensionality curse. To solve the problem, representation of the data points using smaller and approximate signatures has been also proposed in recent years. The typical examples under this paradigm include the VA-file  and the IQ-tree . The basic idea of VA-file is to divide the data space into \(2^b\) rectangular cells, where b denotes a user-specified number of bits. The scheme allocates a unique bit-string of length b for each cell, and approximates data points that fall into a cell by that bit-string. The VA-file itself is simply an array of these approximations. KNN searches are performed by scanning the entire approximation file, and by excluding the vast majority of vectors from the search (filtering step) based on these approximations. When searching for the nearest neighbors, the entire approximation file is scanned and the upper and lower bounds on the distance to the query can easily be determined based on the rectangular cell represented by the approximations. After the filtering step, a small set of candidates are then visited and the actual distances to the query point Q are determined. The VA-file has been shown to perform well for disk-based systems as it reduces the number of random I/Os. The IQ-tree was proposed based on the concept of quantization . The compressed index has a three-level structure: the first level is a regular (flat) directory consisting of minimum bounding boxes, the second level contains data points in a compressed representation, and the third level contains the actual data. On the other hand, the compressed MBRs can reduced the disk I/O during the search processing. One-dimensional transformations provide another direction for high-dimensional indexing. The iDistance  was presented as an efficient method for KNN search in a multidimensional space. iDistance partitions the data and selects a reference point for each partition. The data points in each cluster are transformed into a single-dimensional space based on their similarity with respect to a reference point. It then indexes the distance of each data point to the reference point of its partition. Since this distance is a simple scalar, with a small mapping effort to keep partitions distinct, it is possible to use a standard B\(^+\)-tree structure to index the data and KNN search be performed using one-dimensional range search. More recently, as a novel indexing structure to support fast approximate query processing, LSH has attracted a lot of research attentions. The first LSH-based music search system is developed by Yan . It aims to apply LSH to speed up the nearest neighbor search and the acoustic feature used is short-time Fourier transform (STFT). Yu et al. develop dual-phase LSH-based algorithm to improve accuracy and scalability of content-based music information retrieval systems . More recently, McFee and Lanckriet apply variants of the classical KD-tree to support content-based similarity search over the Million Song Dataset .
Due to the difficulty of indexing very high dimensional data space, a reasonable approach might be to reduce the dimensionality to a “reasonable” level (e.g., 10–12 dimensions), and then use an existing “high-dimensional” indexing scheme as an access method (e.g., M-tree or R-tree). In the past 2 decades, there have been a lot of research efforts on developing dimension reduction methods. The techniques can be classified into two independent categories: linear dimension reduction (LDR) and nonlinear dimension reduction (NLDR). Basic idea of LDR is to apply linear statistical analysis to map the original high-dimensional features to low-dimensional ones by eliminating the redundant information from the original feature space. The most well-known statistical approaches for doing this is the principal component analysis (PCA) and linear discriminative analysis (LDA). The fundamental of NLDR is the standard nonlinear statistical analysis and machine learning algorithm, which have been widely explored by various research communities in recent years. However, the drawbacks of NLDR are that the training of a learning algorithm requires high-quality training examples and that training can be computationally inefficient.
In recent years, advanced hashing has been playing more and more important role in support of fast and effective multimedia information retrieval [30, 31, 32]. Consequently, a steady progress in the related field has been observed.
2.2 Music signature generation
The second stream of previous research is about how to model music contents and develop effective scheme to generate comprehensive music signatures. Indeed, various kinds of music features can be applied for categorizing and indexing music collections. They include text, acoustic features and symbolic signature of music melody. Here, our primary focus is on content-based acoustic features.
Summary of symbols and definitions
Music class c
Feature type f
Loss function of deep learning framework
Number of classes in the database
Number of blocks for music segmentation
Number of acoustic features extracted
Number of training examples for logistic fusion function
Score combination function
Deep music signature
Parameter set for GMM
Feature vector extracted from block b for feature type f
Set of feature vectors extracted from different blocks for feature type f
Final score generated by logistic combination function for class c
Likelihood value generated by category c’s profile model using feature type f
Fusion weight vector of logistic fusion function for class c
Squared error loss
Feature extraction scheme for feature type f
3 System architecture
This section gives a detailed introduction on overall system architecture of EMIF, its two basic components and related algorithms. EMIF, as illustrated in Fig. 1, consists of two major functionality layers—music classification module and indexing module. The main functionality of the first layer is to categorize input music accurately and music indexing module contains a group of deep learning music signature generation schemes and indexing trees, one for each category. To search a set of similar music based on input example, different features are extracted first and then music category c can be identified. Finally, top k songs are returned after search using the local indexing tree for category c.
3.1 Multifeature-based music category modeling
3.1.1 Feature extraction
Timbre feature Timbral texture is a global statistical music property used to differentiate a mixture of sounds. It has been widely applied to speech recognition and audio classification. The 33-dimensional feature vector representing timbre feature includes means and variance of spectral centroid, spectral flux, time domain zero crossings and 13 MFCC coefficients (32) plus low energy(1).
Rhythm feature Rhythmic content indicates reiteration of musical signal over time. It can be represented as beat strength and temporal pattern. The beat histogram (BH) proposed by Tzanetakis et al.  is used to describe rhythmic content. The 18-dimensional feature vector is used to represent rhythmic information of music and includes relative amplitude of the first six histogram peaks (divided by the sum of amplitudes), ratio of the amplitude of five histogram peaks (from second to sixth) divided by the amplitude of the first one, period of the first six histogram peaks, and overall sum of the histogram.
Pitch feature Pitch is an important acoustic feature used to characterize melody and harmony information in music file. It can be extracted via the multi-pitch detection techniques . The 18-dimensional pitch feature vector includes the amplitude and periods of the maximum six peaks in the histogram, pitch interval between the six most prominent peaks and the overall sums of the histograms.
3.1.2 Statistical category profiling with linear discriminative mixture model
For the purpose of effective category identification, EMIF constructs a statistical model for each class using multiple multiple features. To achieve this, the individual feature of the music objects is extracted, and then individual profiling model for one class is built based on each feature. In our framework, category profiling aims to capture statistical properties of different features using linear discriminative mixture model (LDMM), which is a novel classification scheme combining the advantages of both LDA and GMMs. The main advantage of LDA over other linear subspace methods is to generate a discriminative feature space to maximize the ratio of between-class scatter against within-class scatter (Fisher’s criterion). In the LDMM for ea
3.2 Fusion weight estimation
To gain a comprehensive statistical model for each music category in EMIF, it is very important to develop effective fusion weight estimation scheme to compute likelihood score. In this article, we introduce two approaches, which are similar to the ones used in .
3.2.1 Logistic regression-based scheme
3.3 Deep music signature generation and music retrieval
PCA is used to preprocess raw input features from different blocks via linear transformation and speed up learning of SDA.
SDA is adopted to pretrain neural networks for each block with unlabeled data.
For each block of input music documents, the parameters of SDA are optimized via stochastic gradient descent .
In our approach, SDA is applied to build deep learning architecture for computing DMS as basic component, one for each music block. We initialize the deep neural network using the same strategy which stacking RBMs in deep belief networks apply. Figure 7 illustrates the procedure to gain multilayer DAE. First, the corrupted input is only used for training each layer at very beginning. This is very important to learn effective features. Right after the mapping function \(\varPhi\) has been learnt successfully, it can be applied to process uncorrupted inputs. Then to train the neurons in the next layer, corrupted training examples will be used as inputs.
After a set of encoders are trained and stacked as SDAs, outputs from top layer serve as music content representation—DMS and inputs to different indexing structure for effective and efficient music search. In this study, we apply stochastic gradient descent to infer and optimize various parameters of the SDAs due to its good efficiency .
4 Experimental configuration
Before presenting experimental results, we first introduce the experimental configuration including the test music datasets, evaluation metrics, query tasks and competitors considered for performance comparison.
4.1 Music testbed
The testbed plays an important role in evaluating content-based music retrieval systems. To facilitate the evaluation, three separate music databases are used. The first one, called Dataset I, is used for testing performance of different methods on genre-based retrieval. It contains 5000 music data items covering ten genres with 500 songs per genre. This dataset is very similar to the test collection used in [40, 41]. To ensure variety of recording quality, the excerpts of this dataset were taken from radio, compact disks, and MP3 compressed audio files. It consists of ten music genre categories: Classical, Country, Dance, Hip-hop, Jazz, Reggae, Metal, Blues and Pop. The second dataset, called Dataset II, is used for evaluating performance of different methods on artist-based query. It contains 7000 songs covering 50 different artists. It includes 25 male singers (such as Van Morrison, Michael Jackson, Elton John) and 25 female singers (such as Kylie Minogue, Madonna, Jennifer Lopez). Thus, there are 140 songs for each singer in Dataset II. Dataset III contains 1000 sounds covering 10 different solo instruments such as piano, guitar and violin, and there are 100 music items for each category. This dataset is developed to test performance of instrument-based similarity search. The music in both Dataset II and Dataset III was collected from the CD collection of the authors and their friends.
4.2 Evaluation metrics and tasks
The efficacy of multimedia retrieval systems can be assessed by different performance metrics. The different kinds of measures can reflect different characteristics of each system. In this study, our goal is to demonstrate the effectiveness of our evaluation methodology under different kinds of measures. Thus, we test the methodology with various evaluation metrics. They include precision measured up to a certain rank (P@k) and mean average precision (MAP).
Type I: Search music that has similar genre from database constructed using Dataset I.
Type II: Search music performed by the same artist from database constructed using Dataset II.
Type III: Search music with the same instrument from database constructed using Dataset III.
EMIF: In this study, a CBMIR system is built based on EMIF and Hybrid tree is selected as multidimensional indexing structure to speed up music search.
DWCH + hybrid tree (DWCH+HT): Daubechies wavelet histogram technique (DWCH) is used to extract wavelet-based music signatures to describe music content. Similar to EMIF, Hybrid tree is the indexing structure for speeding up search process.
MARSYAS + hybrid tree (MARSYAS+HT): MARSYAS framework is used to extract the signatures, which linearly combines three different acoustic features—timbral texture, pitch content and rhythm. Similar to EMIF and DWCH+HT, Hybrid tree is the indexing structure for speeding up search process.
5 An empirical study
This section presents an experimental study to evaluate the proposed method and its competitive schemes. Our results demonstrate the superiority of EMIF against other state-of-the-art approaches over a range of different measures, including accuracy of retrieval, scalability to accommodate different sizes of data and handle update process, improvement on efficiency in terms of the query response time.
5.1 Effectiveness comparison
Query accuracy comparison of EMIF and other approaches for music retrieval based on Query Type I
Query accuracy comparison of EMIF and other approaches for music retrieval based on Query Type II
Query accuracy comparison of EMIF and other approaches for music retrieval based on Query Type III
The decision module with logistic regression score fusion function plays an important role in enhancement of EMIF’s performance. To investigate the effects of the decision module, we compare the difference between EMIF with and without decision module via experiments over three different query types. Tables 2, 3 and 4 present relative gains in query accuracy when the decision module is integrated for weight estimation. Integrating the decision module has a strong influence on the retrieval accuracy for all different query cases. We find that the corresponding performance improvement is fairly high (about 17%). The main reason behind this performance gain is that the misclassification can be captured using the weight of scores from different features with logistic-based learning. The misclassification by LDMM-based category models in the first layer of the system is further corrected by the inductive process of LR via an adaboost-like training algorithm. This implies that final classification accuracy can be improved significantly via the performance compensation in the decision module. Experimental results also validate this finding empirically.
5.2 Efficiency comparison
For large music databases, response time to query is another key indicator for system performance. Although the statistical concept model and decision module in EMIF lift the accuracy significantly, they might introduce extra query cost overhead. In this experiment, we show how it affects the time efficiency. A test was run with 1000 query examples randomly selected from the music datasets. Figures 8, 9 and 10 show the response time of three different queries for different methods with various size of the result sets. From the experimental results summarized in the figures, we can see that EMIF achieves great gain in terms of query speed against the other approaches for all sizes of result set. MARSYAS+HT performs worst among all different approaches tested. EMIF achieves the best response time over different query tasks and compared to other approaches, performance improvement is very significant, at least 14.6%. The main reason behind this is that EMIF’s layered structure which facilitates retrieval processing based on index structure from individual class results in a more compact searching space (smaller indexing structure). Consequently, this improves the final query speed significantly. Further, another major advantage of our scheme is its simplicity. All the components in our framework (such as LDA, GMM, single layer neural network and logistic regression) are standard techniques which can be implemented efficiently.
System reconstruction time comparison of different approaches with different number of classes
System reconstruction time (s)
Query accuracy comparison of different approaches with different number of classes (Query Type I and Dataset I)
Query accuracy (P@10)
5.3 Scalability comparison
Scalability is particularly important for large music databases, because such systems can potentially contain thousands of audio files for retrieval and the content of the data collections could be updated frequently. In this section, we illustrate the behavior of our scheme under different sizes of data. EMIF is evaluated against other schemes using (1) datasets containing different number of classes and (2) datasets containing different number of music objects. Due to space limitation, we only present the empirical results using Dataset I.
In the first experiment, we compare the reconstruction cost and query accuracy of EMIF and other approaches when different classes of music are gradually inserted into the system. Note that the subset of classes and the order of class insertion is chosen randomly. In Table 5, the number of classes varies from 1 to 10. The results show that compared to other methods, EMIF consumes much less construction time. One thing worth noting is that when the number of classes is less than 2, all other methods use less time to complete construction than EMIF does. This is because besides building indexing tree and music signature generation scheme for music from new class, EMIF’s construction cost also includes training time for relative LR analysis. This overhead could make EMIF less efficient in terms of construction cost when the number of classes is small. From Table 5, we also find that there is no significant increase for reconstruction time when the system includes more object classes. The main reason is that with “classify-and-indexing” approach, only one associated indexing structure needs to be built when a new class is integrated into the database. Also, the index’s size is much smaller. Likewise, Table 6 illustrates the query accuracy as the number of classes is varied from 1 to 10. EMIF demonstrates much better stability in terms of query accuracy. In contrast, performance of all other methods deteriorates rapidly with the growth of the number of classes.
Case I—static: For the static case, the system is initially trained and tested with 1000 music. Then we increase the dataset to 2000 music, train the system again and evaluate it. This process is repeated until the size of music reaches to 5000 music.
Case II—incremental: In this setting, the system is trained and evaluated with 1000 music at the first stage. Then, 1000 music is added into the system without rerunning the training process and we carry out the evaluation on the systems again. The process will be repeated until the size of music reaches to 5000 music.
Query accuracy comparison of different approaches with different number of music—static case
Query accuracy (P@10)
Query accuracy comparison of different approaches with different number of music—incremental case
Query accuracy (P@10)
In this article, we introduce a novel approach called EMIF based on the “classify-and-indexing” design principle. To achieve a more scalable indexing framework, an independent LDMM-based profiling model for each music category is constructed using multiple features to generate likelihood score. To address robustness issues, a decision module with logistic regression-based score fusion function has been developed to further improve classification accuracy. Moreover, EMIF’s layered architecture results in more compact indexing structure for each category and consequently achieves a significant reduction on query execution time and updating cost. Combination of two schemes further enhances the scalability of the whole system, while providing large-scale music search with effectiveness and efficiency. To validate the approach, we have carried out comprehensive experiments and the results demonstrate the various advantages of EMIF over the existing state-of-the-art indexing methods.
The current study can be extended in several interesting directions for future investigation: first, at this stage, our method is only tested using music data. It would be interesting to apply the method to data on other application domains (e.g., image and video retrieval) and investigate corresponding experimental results. In addition, developing a framework for estimating cost model of indexing framework construction and maintenance is another promising direction. Last but not least, we plan to design advanced fusion scheme to combine scores.
- 3.Chen, J., Zhang, H., He, X., Nie, L., Liu, W., Chua, T.: Attentive collaborative filtering: Multimedia recommendation with item- and component-level attention. In: Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, Shinjuku, Tokyo, Japan, August 7–11, 2017, pp. 335–344 (2017)Google Scholar
- 4.He, X., He, Z., Du, X., Chua, T.: Adversarial personalized ranking for recommendation. In: The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, SIGIR 2018, Ann Arbor, MI, USA, July 08-12, 2018, pp. 355–364 (2018)Google Scholar
- 6.He, X., Gao, M., Kan, M., Wang, D.: Birank: Towards ranking on bipartite graphs, to appear in IEEE Trans. Knowl. Data Eng. (2017)Google Scholar
- 8.Essid, S., Richard, G.: Fusion of multimodal information in music content analysis. In: Multimodal Music Processing, pp. 37–52 (2012)Google Scholar
- 10.Cheng, Z., Shen, J., Hoi, S. C. H.: On effective personalized music retrieval by exploring online user behaviors. In: Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval, SIGIR 2016, Pisa, Italy, July 17-21, 2016, pp. 125–134 (2016)Google Scholar
- 11.Cheng, Z., Shen, J., Zhu, L., Kankanhalli, M. S., Nie, L.: Exploiting music play sequence for music recommendation. In: Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, IJCAI 2017, Melbourne, Australia, August 19-25, 2017, pp. 3654–3660 (2017)Google Scholar
- 12.Schedl, M., Yang, Y., Herrera-Boyer, P.: Introduction to intelligent music systems and applications. ACM TIST 8(2), 17:1–17:8 (2017)Google Scholar
- 14.Deldjoo, Y., Constantin, M. G., Ionescu, B., Schedl, M., Cremonesi, P.: MMTF-14K: a multifaceted movie trailer feature dataset for recommendation and retrieval. In: Proceedings of the 9th ACM Multimedia Systems Conference, MMSys 2018, Amsterdam, The Netherlands, June 12-15, 2018, pp. 450–455 (2018)Google Scholar
- 17.Yang, C.: Efficient acoustic index for music retrieval with various degrees of similarity. In: Proc. of ACM MM Conference (2002)Google Scholar
- 18.Ryynänen, M., Klapuri, A.: Query by humming of midi and audio using locality sensitive hashing. In: Proceedings of ICASSP, pp. 2249–2252 (2008)Google Scholar
- 25.Ciaccia, P., Patella, M., Zezula, P.: M-tree: An efficient access method for similarity search in metric spaces. In: Proc. of the 23rd VLDB conference (VLDB’97) (1997)Google Scholar
- 26.Weber, R., Schek, H. J., Blott, S.: A quantitative analysis and performance study for similarity-search methods in high-dimensional spaces. In: Proc. of 24th VLDB Conference (VLDB’98) (1998)Google Scholar
- 27.Chakrabarti, K., Mehrotra, S.: The hybrid tree: an index structure for high dimensional feature spaces. In: Proc. of ICDE Conference (ICDE’99) (1999)Google Scholar
- 28.Yu, C., Ooi, B. C., Tan, K. L., Jagadish, H. V.: Indexing the distance: an efficient method to knn processing. In: Proc. of 27th VLDB Conference (VLDB’01) (2001)Google Scholar
- 29.Yu, Y., Crucianu, M., Oria, V., Damiani, E.: Combining multi-probe histogram and order-statistics based lsh for scalable audio content retrieval. In: Proc. of ACM MM Conference (2010)Google Scholar
- 30.Wu, G., Han, J., Guo, Y., Liu, L., Ding, G., Ni, Q., Shao, L.: Unsupervised deep video hashing via balanced code for large-scale video retrieval. to appear in IEEE Trans. on Image Processing Google Scholar
- 32.Wu, G., Han, J., Lin, Z., Ding, G., Zhang, B., Ni, Q.: Joint image-text hashing for fast large-scale cross-media retrieval using self-supervised deep learning. to appear in IEEE Trans. on Industrial Electronics Google Scholar
- 33.Wu, G., Lin, Z., Han, J., Liu, L., Ding, G., Zhang, B., Shen, J.: Unsupervised deep hashing via binary latent factor models for large-scale cross-modal retrieval. In: Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence, IJCAI 2018, July 13-19, 2018, Stockholm, Sweden., pp. 2854–2860 (2018)Google Scholar
- 34.Berchtold, S., Bohm, C., Kriegel, H., Sander, J., Jagadish, H.: Independent quantization: an index compression technique for high-dimensional data spaces. In: Proc. of 16th ICDE Conference (ICDE’00) (2000)Google Scholar
- 35.McFee, B., Lanckriet, G. R. G.: Large-scale music similarity search with spatial trees. In: ISMIR (2011)Google Scholar
- 36.Rabiner, L., Juang, B.-H.: Fundamentals of speech recognition (1993)Google Scholar
- 37.Nam, U., Berger, J.: Addressing the same but different-different but similar problem in automatic music classification. In: Proc. of ISMIR (2001)Google Scholar
- 38.Li, G., Khokhar, A.A.: Content-based indexing and retrieval of audio data using wavelets. In: Proceedings of IEEE International Conference on Multimedia and Expo(II), pp. 885–888 (2000)Google Scholar
- 41.Li, T., Ogihara, M., Li, Q.: A comparative study on content-based music genre classification. In: Proc. of ACM SIGIR Conference (SIGIR’03) (2003)Google Scholar
- 47.Jordan, M.I.: Why the logistic function? a tutorial discussion on probabilities and neural networks. Massachusetts Institute of Technology, Tech. Rep. 9503 (1995)Google Scholar
- 49.Collins, M., Schapire, R. E., Singer, Y.: Logistic regression, adaboost and bregman distances. In: Proc. of the 13th Annual Conference on Computational Learning Theory (COLT’00) (2000)Google Scholar
- 50.Bottou, L., Bousquet, O.: The tradeoffs of large scale learning. In: NIPS (2007)Google Scholar
- 51.Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y., Manzagol, P.-A.: Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion. J. Mach. Learn. Res. 11, 3371–3408 (2010)Google Scholar
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.