Advertisement

Outlier Detection in Categorical Data

  • N. N. R. Ranga SuriEmail author
  • Narasimha Murty M
  • G. Athithan
Chapter
Part of the Intelligent Systems Reference Library book series (ISRL, volume 155)

Abstract

This chapter delves on a specific research issue connected with outlier detection problem, namely type of data attributes. More specifically, the case of analyzing data described using categorical attributes/features is presented here. It is known that the performance of a detection algorithm directly depends on the way outliers are perceived. Typically, categorical data are processed by considering the occurrence frequencies of various attributes values. Accordingly, the objective here is to characterize the deviating nature of data objects with respect to individual attributes as well as in the joint distribution of two or more attributes. This can be achieved by defining the measure of deviation in terms of the attribute value frequencies. Also, cluster analysis provides valuable insights on the inherent grouping structure of the data that helps in identifying the deviating objects. Based on this understanding, this chapter presents algorithms developed for detection of outliers in categorical data.

References

  1. 1.
    Bandhyopadhyay, S., Santra, S.: A genetic approach for efficient outlier detection in projected space. Pattern Recognit. 41, 1338–1349 (2008)Google Scholar
  2. 2.
    Bock, H.H.: The classical data situation. In: Analysis of Symbolic Data, pp. 139–152. Springer (2002)Google Scholar
  3. 3.
    Boriah, S., Chandola, V., Kumar, V.: Similarity measures for categorical data: a comparative evaluation. In: SIAM International Conference on Data Mining, Atlanta, Georgia, USA, pp. 243–254 (2008)Google Scholar
  4. 4.
    Breunig, M., Kriegel, H., Ng, R., Sander, J.: Lof: identifying density-based local outliers. In: ACM SIGMOD International Conference on Management of Data, Dallas, Texas, pp. 93–104 (2000)Google Scholar
  5. 5.
    Cao, F., Liang, J., Bai, L.: A new initialization method for categorical data clustering. Expert. Syst. Appl. 36, 10223–10228 (2009)CrossRefGoogle Scholar
  6. 6.
    Chandola, V., Banerjee, A., Kumar, V.: Anomaly detection: a survey. ACM Comput. Surv. 41(3) (2009)CrossRefGoogle Scholar
  7. 7.
    Chandola, V., Banerjee, A., Kumar, V.: Anomaly detection for discrete sequences: a survey. IEEE Trans. Knowl. Data Eng. (TKDE) 24(5), 823–839 (2012)CrossRefGoogle Scholar
  8. 8.
    Cui, Z., Ramanna, S., Peters, J.F., Pal, S.K.: Cognitive informatics and computational intelligence: theory and applications. Fundam. Inform. 124(1–2), v–viii (2013)Google Scholar
  9. 9.
    Das, K., Schneider, J.: Detecting anomalous records in categorical datasets, San Jose, California. In: ACM KDD, pp. 220–229 (2007)Google Scholar
  10. 10.
    Dua, D., Efi, K.T.: UCI machine learning repository (2017). http://archive.ics.uci.edu/ml
  11. 11.
    Duan, L., Xu, L., Liu, Y., Lee, J.: Cluster-based outlier detection. Ann. Oper. Res. 168, 151–168 (2009)MathSciNetCrossRefGoogle Scholar
  12. 12.
    Fawcett, T.: An introduction to ROC analysis. Pattern Recognit. Lett. 27, 861–874 (2006)MathSciNetCrossRefGoogle Scholar
  13. 13.
    Guha, S., Rastogi, R., Kyuseok, S.: ROCK: A robust clustering algorithm for categorical attributes. In: International Conference on Data Engineering (ICDE), Sydney, Australia, pp. 512–521 (1999)Google Scholar
  14. 14.
    He, Z., Xu, X., Deng, S.: A fast greedy algorithm for outlier mining. In: Proceedings of Pacific Asia Conference on Knowledge Discovery in Databases (PAKDD), Singapore, pp. 567–576 (2006)Google Scholar
  15. 15.
    He, Z., Xu, X., Deng, S.: Discovering cluster-based local outliers. Pattern Recognit. Lett. 24, 1641–1650 (2003)CrossRefGoogle Scholar
  16. 16.
    He, Z., Xu, X., Deng, S.: k-ANMI: a mutual information based clustering algorithm for categorical data. Inf. Fusion 9, 223–233 (2008)CrossRefGoogle Scholar
  17. 17.
    Hido, S., Tsuboi, Y., Kashima, H., Sugiyama, M., Kanamori, T.: Statistical outlier detection using direct density ratio estimation. Knowl. Inf. Syst. 26(2), 309–336 (2011)CrossRefGoogle Scholar
  18. 18.
    Hodge, V.J., Austin, J.: A survey of outlier detection methodologies. Artif. Intell. Rev. 22, 85–126 (2004)CrossRefGoogle Scholar
  19. 19.
    Huang, Z.: A fast clustering algorithm to cluster very large categorical data sets in data mining. In: SIGMOD Data Mining and Knowledge Discovery Workshop, pp. 1–8 (1997)Google Scholar
  20. 20.
    Jain, A.K.: Data clustering: 50 years beyond K-means. Pattern Recognit. Lett. 31, 651–666 (2010)CrossRefGoogle Scholar
  21. 21.
    Jain, A.K., Duin, R.P.W., Mao, J.: Statistical pattern recognition: a review. IEEE Trans. Pattern Anal. Mach. Intell. 22, 4–37 (2000)CrossRefGoogle Scholar
  22. 22.
    Koufakou, A., Ortiz, E., Georgiopoulos, M.: A scalable and efficient outlier detection strategy for categorical data. In: Proceedings of IEEE ICTAI, Patras, Greece, pp. 210–217 (2007)Google Scholar
  23. 23.
    Li, S., Lee, R., Lang, S.D.: Mining distance-based outliers from categorical data. In: IEEE ICDM Workshop, Omaha, Nebraska, pp. 225–230 (2007)Google Scholar
  24. 24.
    Muller, E., Assent, I., Steinhausen, U., Seidl, T.: Outrank: ranking outliers in high dimensional data. In: IEEE ICDE Workshop, Cancun, Mexico, pp. 600–603 (2008)Google Scholar
  25. 25.
    Ng, M.K., Li, M.J., Huang, J.Z., He, Z.: On the impact of dissimilarity measure in k-modes clustering algorithm. IEEE Trans. Pattern Anal. Mach. Intell. 29(3), 503–507 (2007)CrossRefGoogle Scholar
  26. 26.
    Suri, N.N.R.R., Murty, M., Athithan, G.: An algorithm for mining outliers in categorical data through ranking. In: 12th International Conference on Hybrid Intelligent Systems (HIS), pp. 247–252. IEEE Xplore, Pune, India (2012)Google Scholar
  27. 27.
    Suri, N.N.R.R., Murty, M., Athithan, G.: Data mining techniques for outlier detection. In: Zhang, Q., Segall, R.S., Cao, M. (eds.) Visual Analytics and Interactive Technologies: Data, Text and Web Mining Applications, Chap. 2, pp. 22–38. IGI Global, New York, USA (2011)Google Scholar
  28. 28.
    Suri, N.N.R.R., Murty, M., Athithan, G.: A ranking-based algorithm for detection of outliers in categorical data. Int. J. Hybrid Intell. Syst. (IJHIS) 11(1), 1–11 (2014)CrossRefGoogle Scholar
  29. 29.
    Taha, A., Hegazy, O.M.: A proposed outliers identification algorithm for categorical data sets. In: 7th International Conference on Informatics and Systems (INFOS), Cairo, Egypt, pp. 1–5 (2010)Google Scholar
  30. 30.
    Wu, Q., Ma, S.: Detecting outliers in sliding window over categorical data streams. In: 8th International Conference on Fuzzy Systems and Knowledge Discovery (FSKD), pp. 1663–1667. IEEE (2011)Google Scholar
  31. 31.
    Wu, S., Wang, S.: Information-theoretic outlier detection for large-scale categorical data. IEEE Trans. Knowl. Data Eng. (TKDE) 25(3), 589–602 (2013)CrossRefGoogle Scholar
  32. 32.
    Zengyou, H., Xiaofei, X., Shengchun, D.: Squeezer: an efficient algorithm for clustering categorical data. J. Comput. Sci. Technol. 17(5), 611–624 (2002)Google Scholar
  33. 33.
    Zhang, K., Hutter, M., Jin, H.: A new local distance-based outlier detection approach for scattered real-world data. In: PAKDD, Bangkok, Thailand, pp. 813–822 (2009)CrossRefGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  • N. N. R. Ranga Suri
    • 1
    Email author
  • Narasimha Murty M
    • 2
  • G. Athithan
    • 3
  1. 1.Centre for Artificial Intelligence and Robotics (CAIR)BangaloreIndia
  2. 2.Department of Computer Science and AutomationIndian Institute of Science (IISc)BangaloreIndia
  3. 3.Defence Research and Development Organization (DRDO)New DelhiIndia

Personalised recommendations