Skip to main content

Outlier Detection in Categorical Data

  • Chapter
  • First Online:

Part of the book series: Intelligent Systems Reference Library ((ISRL,volume 155))

Abstract

This chapter delves on a specific research issue connected with outlier detection problem, namely type of data attributes. More specifically, the case of analyzing data described using categorical attributes/features is presented here. It is known that the performance of a detection algorithm directly depends on the way outliers are perceived. Typically, categorical data are processed by considering the occurrence frequencies of various attributes values. Accordingly, the objective here is to characterize the deviating nature of data objects with respect to individual attributes as well as in the joint distribution of two or more attributes. This can be achieved by defining the measure of deviation in terms of the attribute value frequencies. Also, cluster analysis provides valuable insights on the inherent grouping structure of the data that helps in identifying the deviating objects. Based on this understanding, this chapter presents algorithms developed for detection of outliers in categorical data.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   149.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Hardcover Book
USD   199.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

  1. Bandhyopadhyay, S., Santra, S.: A genetic approach for efficient outlier detection in projected space. Pattern Recognit. 41, 1338–1349 (2008)

    Google Scholar 

  2. Bock, H.H.: The classical data situation. In: Analysis of Symbolic Data, pp. 139–152. Springer (2002)

    Google Scholar 

  3. Boriah, S., Chandola, V., Kumar, V.: Similarity measures for categorical data: a comparative evaluation. In: SIAM International Conference on Data Mining, Atlanta, Georgia, USA, pp. 243–254 (2008)

    Google Scholar 

  4. Breunig, M., Kriegel, H., Ng, R., Sander, J.: Lof: identifying density-based local outliers. In: ACM SIGMOD International Conference on Management of Data, Dallas, Texas, pp. 93–104 (2000)

    Google Scholar 

  5. Cao, F., Liang, J., Bai, L.: A new initialization method for categorical data clustering. Expert. Syst. Appl. 36, 10223–10228 (2009)

    Article  Google Scholar 

  6. Chandola, V., Banerjee, A., Kumar, V.: Anomaly detection: a survey. ACM Comput. Surv. 41(3) (2009)

    Article  Google Scholar 

  7. Chandola, V., Banerjee, A., Kumar, V.: Anomaly detection for discrete sequences: a survey. IEEE Trans. Knowl. Data Eng. (TKDE) 24(5), 823–839 (2012)

    Article  Google Scholar 

  8. Cui, Z., Ramanna, S., Peters, J.F., Pal, S.K.: Cognitive informatics and computational intelligence: theory and applications. Fundam. Inform. 124(1–2), v–viii (2013)

    Google Scholar 

  9. Das, K., Schneider, J.: Detecting anomalous records in categorical datasets, San Jose, California. In: ACM KDD, pp. 220–229 (2007)

    Google Scholar 

  10. Dua, D., Efi, K.T.: UCI machine learning repository (2017). http://archive.ics.uci.edu/ml

  11. Duan, L., Xu, L., Liu, Y., Lee, J.: Cluster-based outlier detection. Ann. Oper. Res. 168, 151–168 (2009)

    Article  MathSciNet  Google Scholar 

  12. Fawcett, T.: An introduction to ROC analysis. Pattern Recognit. Lett. 27, 861–874 (2006)

    Article  MathSciNet  Google Scholar 

  13. Guha, S., Rastogi, R., Kyuseok, S.: ROCK: A robust clustering algorithm for categorical attributes. In: International Conference on Data Engineering (ICDE), Sydney, Australia, pp. 512–521 (1999)

    Google Scholar 

  14. He, Z., Xu, X., Deng, S.: A fast greedy algorithm for outlier mining. In: Proceedings of Pacific Asia Conference on Knowledge Discovery in Databases (PAKDD), Singapore, pp. 567–576 (2006)

    Google Scholar 

  15. He, Z., Xu, X., Deng, S.: Discovering cluster-based local outliers. Pattern Recognit. Lett. 24, 1641–1650 (2003)

    Article  Google Scholar 

  16. He, Z., Xu, X., Deng, S.: k-ANMI: a mutual information based clustering algorithm for categorical data. Inf. Fusion 9, 223–233 (2008)

    Article  Google Scholar 

  17. Hido, S., Tsuboi, Y., Kashima, H., Sugiyama, M., Kanamori, T.: Statistical outlier detection using direct density ratio estimation. Knowl. Inf. Syst. 26(2), 309–336 (2011)

    Article  Google Scholar 

  18. Hodge, V.J., Austin, J.: A survey of outlier detection methodologies. Artif. Intell. Rev. 22, 85–126 (2004)

    Article  Google Scholar 

  19. Huang, Z.: A fast clustering algorithm to cluster very large categorical data sets in data mining. In: SIGMOD Data Mining and Knowledge Discovery Workshop, pp. 1–8 (1997)

    Google Scholar 

  20. Jain, A.K.: Data clustering: 50 years beyond K-means. Pattern Recognit. Lett. 31, 651–666 (2010)

    Article  Google Scholar 

  21. Jain, A.K., Duin, R.P.W., Mao, J.: Statistical pattern recognition: a review. IEEE Trans. Pattern Anal. Mach. Intell. 22, 4–37 (2000)

    Article  Google Scholar 

  22. Koufakou, A., Ortiz, E., Georgiopoulos, M.: A scalable and efficient outlier detection strategy for categorical data. In: Proceedings of IEEE ICTAI, Patras, Greece, pp. 210–217 (2007)

    Google Scholar 

  23. Li, S., Lee, R., Lang, S.D.: Mining distance-based outliers from categorical data. In: IEEE ICDM Workshop, Omaha, Nebraska, pp. 225–230 (2007)

    Google Scholar 

  24. Muller, E., Assent, I., Steinhausen, U., Seidl, T.: Outrank: ranking outliers in high dimensional data. In: IEEE ICDE Workshop, Cancun, Mexico, pp. 600–603 (2008)

    Google Scholar 

  25. Ng, M.K., Li, M.J., Huang, J.Z., He, Z.: On the impact of dissimilarity measure in k-modes clustering algorithm. IEEE Trans. Pattern Anal. Mach. Intell. 29(3), 503–507 (2007)

    Article  Google Scholar 

  26. Suri, N.N.R.R., Murty, M., Athithan, G.: An algorithm for mining outliers in categorical data through ranking. In: 12th International Conference on Hybrid Intelligent Systems (HIS), pp. 247–252. IEEE Xplore, Pune, India (2012)

    Google Scholar 

  27. Suri, N.N.R.R., Murty, M., Athithan, G.: Data mining techniques for outlier detection. In: Zhang, Q., Segall, R.S., Cao, M. (eds.) Visual Analytics and Interactive Technologies: Data, Text and Web Mining Applications, Chap. 2, pp. 22–38. IGI Global, New York, USA (2011)

    Google Scholar 

  28. Suri, N.N.R.R., Murty, M., Athithan, G.: A ranking-based algorithm for detection of outliers in categorical data. Int. J. Hybrid Intell. Syst. (IJHIS) 11(1), 1–11 (2014)

    Article  Google Scholar 

  29. Taha, A., Hegazy, O.M.: A proposed outliers identification algorithm for categorical data sets. In: 7th International Conference on Informatics and Systems (INFOS), Cairo, Egypt, pp. 1–5 (2010)

    Google Scholar 

  30. Wu, Q., Ma, S.: Detecting outliers in sliding window over categorical data streams. In: 8th International Conference on Fuzzy Systems and Knowledge Discovery (FSKD), pp. 1663–1667. IEEE (2011)

    Google Scholar 

  31. Wu, S., Wang, S.: Information-theoretic outlier detection for large-scale categorical data. IEEE Trans. Knowl. Data Eng. (TKDE) 25(3), 589–602 (2013)

    Article  Google Scholar 

  32. Zengyou, H., Xiaofei, X., Shengchun, D.: Squeezer: an efficient algorithm for clustering categorical data. J. Comput. Sci. Technol. 17(5), 611–624 (2002)

    Google Scholar 

  33. Zhang, K., Hutter, M., Jin, H.: A new local distance-based outlier detection approach for scattered real-world data. In: PAKDD, Bangkok, Thailand, pp. 813–822 (2009)

    Chapter  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to N. N. R. Ranga Suri .

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Ranga Suri, N.N.R., Murty M, N., Athithan, G. (2019). Outlier Detection in Categorical Data. In: Outlier Detection: Techniques and Applications. Intelligent Systems Reference Library, vol 155. Springer, Cham. https://doi.org/10.1007/978-3-030-05127-3_5

Download citation

Publish with us

Policies and ethics