Detecting and ranking outliers in high-dimensional data

Kaur, Amardeep; Datta, Amitava

doi:10.1007/s12572-018-0240-y

Amardeep Kaur¹ &
Amitava Datta¹

395 Accesses
2 Citations
Explore all metrics

Abstract

Detecting outliers in high-dimensional data is a challenging problem. In high-dimensional data, outlying behaviour of data points can only be detected in the locally relevant subsets of data dimensions. The subsets of dimensions are called subspaces and the number of these subspaces grows exponentially with increase in data dimensionality. A data point which is an outlier in one subspace can appear normal in another subspace. In order to characterise an outlier, it is important to measure its outlying behaviour according to the number of subspaces in which it shows up as an outlier. These additional details can aid a data analyst to make important decisions about what to do with an outlier in terms of removing, fixing or keeping it unchanged in the dataset. In this paper, we propose an effective outlier detection algorithm for high-dimensional data which is based on a recent density-based clustering algorithm called SUBSCALE. We also provide ranking of outliers in terms of strength of their outlying behaviour. Our outlier detection and ranking algorithm does not make any assumptions about the underlying data distribution and can adapt according to different density parameter settings. We experimented with different datasets, and the top-ranked outliers were predicted with more than 82% precision as well as recall.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Fan, J., Han, F., Liu, H.: Challenges of big data analysis. Natl. Sci. Rev. 1, 293–314 (2014)
Article Google Scholar
Dasu, T., Johnson, T.: Exploratory Data Mining and Data Cleaning, vol. 479. Wiley, Hoboken (2003)
Book MATH Google Scholar
Rahm, E., Do, H.H.: Data cleaning: problems and current approaches. IEEE Data Eng. Bull. 23, 2000 (2000)
Google Scholar
Lee, Y.W., Pipino, L.L., Funk, J.D., Wang, R.Y.: Journey to Data Quality. The MIT Press, Cambridge (2009)
Google Scholar
Kim, W., Choi, B.J., Hong, E.K., Kim, S.K., Lee, D.: A taxonomy of dirty data. Data Min. Knowl. Discov. 7(1), 81–99 (2003)
Article MathSciNet Google Scholar
Hawkins, D.: Identification of Outliers. Chapman and Hall, London (1980)
Book MATH Google Scholar
Osborne, J.W., Overbay, A.: The power of outliers (and why researchers should always check for them). Pract. Assess. Res. Eval. 9(6), 1–12 (2004)
Google Scholar
Redman, T.C.: The impact of poor data quality on the typical enterprise. Commun. ACM 41(2), 79–82 (1998)
Article Google Scholar
Haug, A., Zachariassen, F., Van Liempd, D.: The costs of poor data quality. J. Ind. Eng. Manag. 4(2), 168–193 (2011)
Google Scholar
English, L.P.: Information quality: critical ingredient for national security. J. Database Manag. 16(1), 18–32 (2005)
Article Google Scholar
of Inspector General, O.: Undeliverable as addressed mail. Tech. Rep. MS-AR-14-006, United States Postal Service (2014)
Quality, E.D.: The data quality benchmark report. In: Experian Data Quality, pp. 1–10 (2015)
Koh, H.C., Tan, G., et al.: Data mining applications in healthcare. J. Healthc. Inf. Manag. 19(2), 65 (2011)
Google Scholar
Weiskopf, N.G., Weng, C.: Methods and dimensions of electronic health record data quality assessment: enabling reuse for clinical research. J. Am. Med. Inf. Assoc. 20(1), 144–151 (2013)
Article Google Scholar
Rosenberg, W., Donald, A.: Evidence based medicine: an approach to clinical problem-solving. BMJ Br. Med. J. 310(6987), 1122 (1995)
Article Google Scholar
Md, A.R.F., Md, R.I.H.: Problems in the evidence of evidence-based medicine. Am. J. Med. 103(6), 529–535 (1997)
Article Google Scholar
Berndt, D.J., Fisher, J.W., Hevner, A.R., Studnicki, J.: Healthcare data warehousing and quality assurance. Computer 34(12), 56–65 (2001)
Article Google Scholar
Wang, R.Y., Strong, D.M.: Beyond accuracy: what data quality means to data consumers. J. Manag. Inf. Syst. 12(4), 5–33 (1996)
Article Google Scholar
Godfrey, A.B.: Juran’s Quality Handbook. McGraw Hill, New York (1999)
Google Scholar
Redman, T.C.: Data Quality: The Field Guide. Digital press, Boston (2001)
Google Scholar
Batini, C., Scannapieco, M.: Data Quality: Concepts, Methodologies and Techniques (Data-Centric Systems and Applications). Springer, New York, Secaucus (2006)
MATH Google Scholar
Chapman, A.D.: Principles of data quality. Tech. rep., Global Biodiversity Information Facility, Copenhagen (2005)
Batini, C., Cappiello, C., Francalanci, C., Maurino, A.: Methodologies for data quality assessment and improvement. ACM Comput. Surv. 41(3), 16:1–16:52 (2009)
Article Google Scholar
Fan, W., Geerts, F.: Foundations of Data Quality Management. Morgan and Claypool, San Rafael (2012)
Book MATH Google Scholar
Han, J., Kamber, M., Pei, J.: Data Mining: Concepts and Techniques, 3rd edn. Morgan Kaufmann Publishers, San Francisco (2011)
MATH Google Scholar
Maletic, J.I., Marcus, A.: Data cleansing: beyond integrity analysis. In: MIT Conference on Information Quality, pp. 200–209 (2000)
Van den Broeck, J., Argeseanu Cunningham, S., Eeckels, R., Herbst, K.: Data cleaning: detecting, diagnosing, and editing data abnormalities. PLoS Med. 2(10), e267 (2005)
Article Google Scholar
Filzmoser, P., Maronna, R., Werner, M.: Outlier identification in high dimensions. Comput. Stat. Data Anal. 52, 1694–1711 (2008)
Article MathSciNet MATH Google Scholar
Aggarwal, C.C.: Outlier Analysis. Springer, Berlin (2013)
Book MATH Google Scholar
Hodge, V., Austin, J.: A survey of outlier detection methodologies. Artif. Intell. Rev. 22(2), 85–126 (2004)
Article MATH Google Scholar
Bellman, R.E.: Adaptive Control Processes: A Guided Tour. Princeton University Press, New Jersey (1961)
Book MATH Google Scholar
Aggarwal, C.C., Hinneburg, A., Keim, D.A.: On the Surprising Behavior of Distance Metrics in High Dimensional Space. Springer, Berlin (2001)
Book MATH Google Scholar
Barnett, V., Lewis, T.: Outliers in Statistical Data, 3rd edn. Wiley, Hoboken (1994)
MATH Google Scholar
Zimek, A., Schubert, E., Kriegel, H.P.: A survey on unsupervised outlier detection in high-dimensional numerical data. Stat. Anal. Data Min. 5(5), 363–387 (2012)
Article MathSciNet Google Scholar
Chandola, V., Banerjee, A., Kumar, V.: Anomaly detection: a survey. ACM Comput. Surv. 41(3), 15 (2009)
Article Google Scholar
Knorr, E., Ng, R.: Algorithms for mining distance-based outliers in large datasets. In: In Proceedings of the International Conference on Very Large Databases, pp. 392–403 (1998)
Johnson, T., Kwok, I., Ng, R.: Fast computation of 2-dimensional depth contours. In: Proceedings of 4th International Conference on Knowledge Discovery and Data Mining, vol. 1998, pp. 224–228. AAAI Press (1998)
Ramaswamy, S., Rastogi, R., Shim, K., Ramaswamy, S., Rajeev rastogi, K.S.: Efficient algorithms for mining outliers from large data sets. ACM SIGMOD Rec. 29(2), 427–438 (2000)
Article Google Scholar
Breunig, M., Kriegel, H., Ng, R., Sander, J.: LOF: identifying density-based local outliers. ACM Sigmod Record, pp. 1–12 (2000)
Papadimitriou, S., Kitagawa, H., Gibbons, P.B., Faloutsos, C.: Loci: Fast outlier detection using the local correlation integral. In: 19th International Conference on Data Engineering, 2003. Proceedings, pp. 315–326. IEEE (2003)
Ghoting, A., Parthasarathy, S., Otey, M.: Fast mining of distance-based outliers in high-dimensional datasets. Data Min. Knowl. Discov. 16(3), 349–364 (2008)
Article MathSciNet Google Scholar
Wang, Y., Parthasarathy, S., Tatikonda, S.: Locality sensitive outlier detection: a ranking driven approach. In: 2011 IEEE 27th International Conference on Data Engineering, pp. 410–421 (2011)
Kriegel, H.P., S hubert, M., Zimek, A.: Angle-based outlier detection in high-dimensional data. In: Proceeding of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, p. 444 (2008)
Ruts, I., Rousseeuw, P.J.: Computing depth contours of bivariate point clouds. Comput. Stat. Data Anal. 23(1996), 153–168 (1996)
Article MATH Google Scholar
Muller, E., Schiffer, M.: Statistical selection of relevant subspace projections for outlier ranking. Data Eng. (ICDE) 2011, 434–445 (2011)
Google Scholar
Zhang, J., Wang, H.: Detecting outlying subspaces for high-dimensional data: the new task, algorithms, and performance. Knowl. Inf. Syst. 10(3), 333–355 (2006)
Article MathSciNet Google Scholar
Keller, F.: HiCS: high contrast subspaces for density-based outlier ranking. In: Proceedings of ICDE (1) (2012)
Knorr, E.M., Ng, R.T.: Finding intentional knowledge of distance-based outliers. In: Proceedings of 25th International Conference on Very Large Data Bases, pp. 211–222 (1999)
Aggarwal, C., Yu, P.: Outlier detection for high dimensional data. In: ACM Sigmod Record (2001)
Zhang, J., Lou, M., Ling, T.: Hos-Miner: a system for detecting outlyting subspaces of high-dimensional data. In: Proceedings of the 30th International Conference on Very Large Databases, Toronto, pp. 1265–1268 (2004)
Kriegel, H., Kröger, P., Schubert, E., Zimek, A.: Outlier detection in axis-parallel subspaces of high dimensional data. In: Proceedings of the 13th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining, vol. 1, pp. 831–838 (2009)
Kaur, A., Datta, A.: A novel algorithm for fast and scalable subspace clustering of high-dimensional data. J. Big Data 2(1), 17 (2015)
Article Google Scholar
Agrawal, R., Gehrke, J., Gunopulos, D.: Automatic subspace clustering of high dimensional data for data mining applications. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 94–105 (1998)
Datta, A., Kaur, A., Lauer, T., Chabbouh, S.: Parallel subspace clustering using multi-core and many-core architectures. In: Kirikova, M., Nørvåg, K., Papadopoulos, G.A., Gamper, J., Wrembel, R., Darmont, J., Rizzi, S. (eds.) New Trends in Databases and Information Systems, pp. 213–223. Springer, Cham (2017)
Chapter Google Scholar
Bache, K., Lichman, M.: UCI machine learning repository (2013). http://archive.ics.uci.edu/ml. Accessed 4 Apr 2017
Little, M.A., McSharry, P.E., Roberts, S.J., Costello, D.A., Moroz, I.M., et al.: Exploiting nonlinear recurrence and fractal scaling properties for voice disorder detection. BioMed. Eng. OnLine 6(1), 23 (2007)
Article Google Scholar

Download references

Author information

Authors and Affiliations

School of Computer Science and Software Engineering, University of Western Australia, Perth, Australia
Amardeep Kaur & Amitava Datta

Authors

Amardeep Kaur
View author publications
You can also search for this author in PubMed Google Scholar
Amitava Datta
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Amitava Datta.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Kaur, A., Datta, A. Detecting and ranking outliers in high-dimensional data. Int J Adv Eng Sci Appl Math 11, 75–87 (2019). https://doi.org/10.1007/s12572-018-0240-y

Download citation

Published: 14 December 2018
Issue Date: 01 March 2019
DOI: https://doi.org/10.1007/s12572-018-0240-y

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Detecting and ranking outliers in high-dimensional data

Abstract

Access this article

Similar content being viewed by others

Unsupervised outlier detection in multidimensional data

A Novel Density-Based Clustering Approach for Outlier Detection in High-Dimensional Data

On normalization and algorithm selection for unsupervised outlier detection

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Detecting and ranking outliers in high-dimensional data

Abstract

Access this article

Similar content being viewed by others

Unsupervised outlier detection in multidimensional data

A Novel Density-Based Clustering Approach for Outlier Detection in High-Dimensional Data

On normalization and algorithm selection for unsupervised outlier detection

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation