Fast and Scalable Outlier Detection with Approximate Nearest Neighbor Ensembles

Schubert, Erich; Zimek, Arthur; Kriegel, Hans-Peter

doi:10.1007/978-3-319-18123-3_2

Erich Schubert¹⁷,
Arthur Zimek¹⁷ &
Hans-Peter Kriegel¹⁷

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 9050))

Included in the following conference series:

International Conference on Database Systems for Advanced Applications

1958 Accesses
19 Citations

Abstract

Popular outlier detection methods require the pairwise comparison of objects to compute the nearest neighbors. This inherently quadratic problem is not scalable to large data sets, making multidimensional outlier detection for big data still an open challenge. Existing approximate neighbor search methods are designed to preserve distances as well as possible. In this article, we present a highly scalable approach to compute the nearest neighbors of objects that instead focuses on preserving neighborhoods well using an ensemble of space-filling curves. We show that the method has near-linear complexity, can be distributed to clusters for computation, and preserves neighborhoods—but not distances—better than established methods such as locality sensitive hashing and projection indexed nearest neighbors. Furthermore, we demonstrate that, by preserving neighborhoods, the quality of outlier detection based on local density estimates is not only well retained but sometimes even improved, an effect that can be explained by relating our method to outlier detection ensembles. At the same time, the outlier detection process is accelerated by two orders of magnitude.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Achlioptas, D.: Database-friendly random projections: Johnson-Lindenstrauss with binary coins. JCSS 66, 671–687 (2003)
MATH MathSciNet Google Scholar
Achtert, E., Kriegel, H.P., Schubert, E., Zimek, A.: Interactive data mining with 3D-parallel-coordinate-trees. In: Proc. SIGMOD, pp. 1009–1012 (2013)
Google Scholar
Aggarwal, C.C.: Outlier ensembles. SIGKDD Explor. 14(2), 49–58 (2012)
Article Google Scholar
Angiulli, F., Pizzuti, C.: Outlier mining in large high-dimensional data sets. IEEE TKDE 17(2), 203–215 (2005)
MathSciNet Google Scholar
Bache, K., Lichman, M.: UCI machine learning repository (2013). http://www.archive.ics.uci.edu/ml
Bay, S.D., Schwabacher, M.: Mining distance-based outliers in near linear time with randomization and a simple pruning rule. In: Proc. KDD, pp. 29–38 (2003)
Google Scholar
Breunig, M.M., Kriegel, H.P., Ng, R., Sander, J.: LOF: identifying density-based local outliers. In: Proc. SIGMOD, pp. 93–104 (2000)
Google Scholar
Butz, A.R.: Alternative algorithm for Hilbert’s space-filling curve. IEEE TC 100(4), 424–426 (1971)
Google Scholar
Chan, T.M.: Approximate nearest neighbor queries revisited. Disc. & Comp. Geom. 20(3), 359–373 (1998)
Article MATH Google Scholar
Chandola, V., Banerjee, A., Kumar, V.: Anomaly detection: A survey. ACM CSUR 41(3), Article 15, 1–58 (2009)
Google Scholar
Datar, M., Immorlica, N., Indyk, P., Mirrokni, V.S.: Locality-sensitive hashing scheme based on p-stable distributions. In: Proc. ACM SoCG, pp. 253–262 (2004)
Google Scholar
de Vries, T., Chawla, S., Houle, M.E.: Density-preserving projections for large-scale local anomaly detection. KAIS 32(1), 25–52 (2012)
Google Scholar
Geusebroek, J.M., Burghouts, G.J., Smeulders, A.W.M.: The amsterdam library of object images. Int. J. Computer Vision 61(1), 103–112 (2005)
Article Google Scholar
Gionis, A., Indyk, P., Motwani, R.: Similarity search in high dimensions via hashing. In: Proc. VLDB, pp. 518–529 (1999)
Google Scholar
Hilbert, D.: Ueber die stetige Abbildung einer Linie auf ein Flächenstück. Math. Ann. 38(3), 459–460 (1891)
Article MATH MathSciNet Google Scholar
Houle, M.E., Kriegel, H.-P., Kröger, P., Schubert, E., Zimek, A.: Can shared-neighbor distances defeat the curse of dimensionality? In: Gertz, M., Ludäscher, B. (eds.) SSDBM 2010. LNCS, vol. 6187, pp. 482–500. Springer, Heidelberg (2010)
Chapter Google Scholar
Indyk, P., Motwani, R.: Approximate nearest neighbors: towards removing the curse of dimensionality. In: Proc. STOC, pp. 604–613 (1998)
Google Scholar
Jin, W., Tung, A.K.H., Han, J., Wang, W.: Ranking outliers using symmetric neighborhood relationship. In: Ng, W.-K., Kitsuregawa, M., Li, J., Chang, K. (eds.) PAKDD 2006. LNCS (LNAI), vol. 3918, pp. 577–593. Springer, Heidelberg (2006)
Chapter Google Scholar
Johnson, W.B., Lindenstrauss, J.: Extensions of Lipschitz mappings into a Hilbert space. In: Conference in Modern Analysis and Probability, Contemporary Mathematics, vol. 26, pp. 189–206. American Mathematical Society (1984)
Google Scholar
Kabán, A.: On the distance concentration awareness of certain data reduction techniques. Pattern Recognition 44(2), 265–277 (2011)
Article MATH Google Scholar
Kamel, I., Faloutsos, C.: Hilbert R-tree: an improved R-tree using fractals. In: Proc. VLDB, pp. 500–509 (1994)
Google Scholar
Knorr, E.M., Ng, R.T.: Algorithms for mining distance-based outliers in large datasets. In: Proc. VLDB, pp. 392–403 (1998)
Google Scholar
Lazarevic, A., Kumar, V.: Feature bagging for outlier detection. In: Proc. KDD, pp. 157–166 (2005)
Google Scholar
Lehmann, J., Isele, R., Jakob, M., Jentzsch, A., Kontokostas, D., Mendes, P.N., Hellmann, S., Morsey, M., van Kleef, P., Auer, S., Bizer, C.: DBpedia - a large-scale, multilingual knowledge base extracted from wikipedia. Semantic Web J. (2014)
Google Scholar
Liao, S., Lopez, M.A., Leutenegger, S.T.: High dimensional similarity search with space filling curves. In: Proc. ICDE, pp. 615–622 (2001)
Google Scholar
Matoušek, J.: On variants of the Johnson-Lindenstrauss lemma. Random Structures & Algorithms 33(2), 142–156 (2008)
Article MATH MathSciNet Google Scholar
Morton, G.M.: A computer oriented geodetic data base and a new technique in file sequencing. Tech. rep, International Business Machines Co. (1966)
Google Scholar
Nguyen, G., Franco, P., Mullot, R., Ogier, J.M.: Mapping high dimensional features onto Hilbert curve: applying to fast image retrieval. In: ICPR12, pp. 425–428 (2012)
Google Scholar
Nguyen, H.V., Gopalkrishnan, V.: Efficient pruning schemes for distance-based outlier detection. In: Buntine, W., Grobelnik, M., Mladenić, D., Shawe-Taylor, J. (eds.) ECML PKDD 2009, Part II. LNCS, vol. 5782, pp. 160–175. Springer, Heidelberg (2009)
Chapter Google Scholar
Orair, G.H., Teixeira, C., Wang, Y., Meira Jr., W., Parthasarathy, S.: Distance-based outlier detection: Consolidation and renewed bearing. PVLDB 3(2), 1469–1480 (2010)
Google Scholar
Peano, G.: Sur une courbe, qui remplit toute une aire plane. Math. Ann. 36(1), 157–160 (1890)
Article MATH MathSciNet Google Scholar
Radovanović, M., Nanopoulos, A., Ivanović, M.: Reverse nearest neighbors in unsupervised distance-based outlier detection. IEEE TKDE (2014)
Google Scholar
Ramaswamy, S., Rastogi, R., Shim, K.: Efficient algorithms for mining outliers from large data sets. In: Proc. SIGMOD, pp. 427–438 (2000)
Google Scholar
Rasmussen, A., Porter, G., Conley, M., Madhyastha, H., Mysore, R., Pucher, A., Vahdat, A.: TritonSort: a balanced large-scale sorting system. In: Proceedings of the 8th USENIX Conference on Networked Systems Design and Implementation (2011)
Google Scholar
Schubert, E., Wojdanowski, R., Zimek, A., Kriegel, H.P.: On evaluation of outlier rankings and outlier scores. In: Proc. SDM, pp. 1047–1058 (2012)
Google Scholar
Schubert, E., Zimek, A., Kriegel, H.P.: Local outlier detection reconsidered: a generalized view on locality with applications to spatial, video, and network outlier detection. Data Min. Knowl. Disc. 28(1), 190–237 (2014)
Article MATH MathSciNet Google Scholar
Shepherd, J.A., Zhu, X., Megiddo, N.: Fast indexing method for multidimensional nearest-neighbor search. In: Proc. SPIE, pp. 350–355 (1998)
Google Scholar
Venkatasubramanian, S., Wang, Q.: The Johnson-Lindenstrauss transform: an empirical study. In: Proc. ALENEX Workshop (SIAM), pp. 164–173 (2011)
Google Scholar
Wang, Y., Parthasarathy, S., Tatikonda, S.: Locality sensitive outlier detection: a ranking driven approach. In: Proc. ICDE, pp. 410–421 (2011)
Google Scholar
Zimek, A., Campello, R.J.G.B., Sander, J.: Ensembles for unsupervised outlier detection: Challenges and research questions. SIGKDD Explor. 15(1), 11–22 (2013)
Article Google Scholar
Zimek, A., Campello, R.J.G.B., Sander, J.: Data perturbation for outlier detection ensembles. In: Proc. SSDBM, vol. 13, pp. 1–12 (2014)
Google Scholar
Zimek, A., Gaudet, M., Campello, R.J.G.B., Sander, J.: Subsampling for efficient and effective unsupervised outlier detection ensembles. In: Proc. KDD, pp. 428–436 (2013)
Google Scholar
Zimek, A., Schubert, E., Kriegel, H.P.: A survey on unsupervised outlier detection in high-dimensional numerical data. Stat. Anal. Data Min. 5(5), 363–387 (2012)
Article MathSciNet Google Scholar
Zolotarev, V.M.: One-dimensional stable distributions. Translations of Mathematical Monographs, vol. 65. American Mathematical Society (1986)
Google Scholar

Download references

Author information

Authors and Affiliations

Ludwig-Maximilians-Universität München, Oettingenstr. 67, 80538, München, Germany
Erich Schubert, Arthur Zimek & Hans-Peter Kriegel

Authors

Erich Schubert
View author publications
You can also search for this author in PubMed Google Scholar
Arthur Zimek
View author publications
You can also search for this author in PubMed Google Scholar
Hans-Peter Kriegel
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Erich Schubert .

Editor information

Editors and Affiliations

Universität München, München, Germany
Matthias Renz
University of Southern California, Los Angeles, USA
Cyrus Shahabi
University of Queensland, Brisbane, Australia
Xiaofang Zhou
Monash University, Clayton, Australia
Muhammad Aamir Cheema

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Schubert, E., Zimek, A., Kriegel, HP. (2015). Fast and Scalable Outlier Detection with Approximate Nearest Neighbor Ensembles. In: Renz, M., Shahabi, C., Zhou, X., Cheema, M. (eds) Database Systems for Advanced Applications. DASFAA 2015. Lecture Notes in Computer Science(), vol 9050. Springer, Cham. https://doi.org/10.1007/978-3-319-18123-3_2

Download citation

DOI: https://doi.org/10.1007/978-3-319-18123-3_2
Published: 09 April 2015
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-18122-6
Online ISBN: 978-3-319-18123-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics