Abstract
Learning from high-dimensional data is usually quite a challenging task, as captured by the well known phrase curse of dimensionality. Most distance-based methods become impaired due to the distance concentration of many widely used metrics in high-dimensional spaces. One recently proposed approach suggests that using secondary distances based on the number of shared k-nearest neighbors between different points might partly resolve the concentration issue, thereby improving overall performance. Nevertheless, the curse of dimensionality also affects the k-nearest neighbor inference in severely negative ways, one such consequence being known as hubness. The impact of hubness on forming shared neighbor distances has not been discussed before and it is what we focus on in this paper. Furthermore, we propose a new method for calculating the secondary distances which is aware of the underlying neighbor occurrence distribution. Our experiments suggest that this new approach achieves consistently superior performance on all considered high-dimensional data sets. An additional benefit is that it essentially requires no extra computations compared to the original methods.
Keywords
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Scott, D., Thompson, J.: Probability density estimation in higher dimensions. In: Proceedings of the Fifteenth Symposium on the Interface, Amsterdam, pp. 173–179 (1983)
Aggarwal, C.C., Hinneburg, A., Keim, D.A.: On the surprising behavior of distance metrics in high dimensional spaces. In: Proc. 8th Int. Conf. on Database Theory (ICDT), pp. 420–434 (2001)
François, D., Wertz, V., Verleysen, M.: The concentration of fractional distances. IEEE Transactions on Knowledge and Data Engineering 19(7), 873–886 (2007)
Durrant, R.J., Kabán, A.: When is ‘nearest neighbour’ meaningful: A converse theorem and implications. Journal of Complexity 25(4), 385–397 (2009)
Radovanović, M., Nanopoulos, A., Ivanović, M.: Nearest neighbors in high-dimensional data: The emergence and influence of hubs. In: Proc. 26th Int. Conf. on Machine Learning (ICML), pp. 865–872 (2009)
Radovanović, M., Nanopoulos, A., Ivanović, M.: On the existence of obstinate results in vector space models. In: Proc. 33rd Annual Int. ACM SIGIR Conf. on Research and Development in Information Retrieval, pp. 186–193 (2010)
Aucouturier, J.J., Pachet, F.: Improving timbre similarity: How high is the sky? Journal of Negative Results in Speech and Audio Sciences 1 (2004)
Aucouturier, J.: Ten experiments on the modelling of polyphonic timbre. Technical report, Docteral dissertation, University of Paris 6 (2006)
Radovanović, M., Nanopoulos, A., Ivanović, M.: Hubs in space: Popular nearest neighbors in high-dimensional data. Journal of Machine Learning Research 11, 2487–2531 (2011)
Jarvis, R.A., Patrick, E.A.: Clustering using a similarity measure based on shared near neighbors. IEEE Trans. Comput. 22, 1025–1034 (1973)
Ertz, L., Steinbach, M., Kumar, V.: Finding topics in collections of documents: A shared nearest neighbor approach. In: Proceedings of Text Mine 2001, First SIAM International Conference on Data Mining (2001)
Yin, J., Fan, X., Chen, Y., Ren, J.: High-Dimensional Shared Nearest Neighbor Clustering Algorithm. In: Wang, L., Jin, Y. (eds.) FSKD 2005. LNCS (LNAI), vol. 3614, pp. 494–502. Springer, Heidelberg (2005)
Moëllic, P.A., Haugeard, J.E., Pitel, G.: Image clustering based on a shared nearest neighbors approach for tagged collections. In: Proceedings of the 2008 International Conference on Content-Based Image and Video Retrieval, CIVR 2008, pp. 269–278. ACM, New York (2008)
Houle, M.E., Kriegel, H.-P., Kröger, P., Schubert, E., Zimek, A.: Can Shared-Neighbor Distances Defeat the Curse of Dimensionality? In: Gertz, M., Ludäscher, B. (eds.) SSDBM 2010. LNCS, vol. 6187, pp. 482–500. Springer, Heidelberg (2010)
Bennett, K.P., Fayyad, U., Geiger, D.: Density-based indexing for approximate nearest-neighbor queries. In: ACM SIGKDD Conference Proceedings, pp. 233–243. ACM Press (1999)
Ayad, H., Kamel, M.: Finding Natural Clusters using Multi-Clusterer Combiner Based on Shared Nearest Neighbors. In: Windeatt, T., Roli, F. (eds.) MCS 2003. LNCS, vol. 2709, pp. 166–175. Springer, Heidelberg (2003)
Tomašev, N., Radovanović, M., Mladenić, D., Ivanović, M.: The Role of Hubness in Clustering High-Dimensional Data. In: Huang, J.Z., Cao, L., Srivastava, J. (eds.) PAKDD 2011, Part I. LNCS, vol. 6634, pp. 183–195. Springer, Heidelberg (2011)
Buza, K., Nanopoulos, A., Schmidt-Thieme, L.: INSIGHT: Efficient and Effective Instance Selection for Time-Series Classification. In: Huang, J.Z., Cao, L., Srivastava, J. (eds.) PAKDD 2011, Part II. LNCS, vol. 6635, pp. 149–160. Springer, Heidelberg (2011)
Tomašev, N., Mladenić, D.: Exploring the hubness-related properties of oceanographic sensor data. In: Proceedings of the SiKDD Conference (2011)
Tomašev, N., Radovanović, M., Mladenić, D., Ivanović, M.: Hubness-based fuzzy measures for high dimensional k-nearest neighbor classification. In: Machine Learning and Data Mining in Pattern Recognition Conference, MLDM, New York (2011)
Tomašev, N., Radovanović, M., Mladenić, D., Ivanović, M.: A probabilistic approach to nearest neighbor classification: Naive hubness bayesian k-nearest neighbor. In: Proceeding of the CIKM Conference (2011)
Tomašev, N., Mladenić, D.: Nearest neighbor voting in high-dimensional data: learning from past occurences. In: PhD forum, ICDM Conference
Fix, E., Hodges, J.: Discriminatory analysis, nonparametric discrimination: consistency properties. Technical report, USAF School of Aviation Medicine, Randolph Field, Texas (1951)
Stone, C.J.: Consistent nonparametric regression. Annals of Statistics 5, 595–645 (1977)
Devroye, L., Györfi, A.K., Lugosi, G.: On the strong universal consistency of nearest neighbor regression function estimates. Annals of Statistics 22, 1371–1385 (1994)
Cover, T.M., Hart, P.E.: Nearest neighbor pattern classification. IEEE Transactions on Information Theory IT-13(1), 21–27 (1967)
Devroye, L.: On the inequality of cover and hart. IEEE Transactions on Pattern Analysis and Machine Intelligence 3, 75–78 (1981)
Chen, J., Ren Fang, H., Saad, Y.: Fast approximate kNN graph construction for high dimensional data via recursive Lanczos bisection. Journal of Machine Learning Research 10, 1989–2012 (2009)
Tomašev, N., Brehar, R., Mladenić, D., Nedevschi, S.: The influence of hubness on nearest-neighbor methods in object recognition. In: IEEE Conference on Intelligent Computer Communication and Processing (2011)
Lowe, D.G.: Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision 60(2), 91 (2004)
Zhang, Z., Zhang, R.: Multimedia Data Mining: a Systematic Introduction to Concepts and Theory. Chapman and Hall (2008)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2012 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Tomašev, N., Mladenić, D. (2012). Hubness-Aware Shared Neighbor Distances for High-Dimensional k-Nearest Neighbor Classification. In: Corchado, E., Snášel, V., Abraham, A., Woźniak, M., Graña, M., Cho, SB. (eds) Hybrid Artificial Intelligent Systems. HAIS 2012. Lecture Notes in Computer Science(), vol 7209. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-28931-6_12
Download citation
DOI: https://doi.org/10.1007/978-3-642-28931-6_12
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-28930-9
Online ISBN: 978-3-642-28931-6
eBook Packages: Computer ScienceComputer Science (R0)