Outlier Detection in Categorical Data

Ranga Suri, N. N. R.; Murty M, Narasimha; Athithan, G.

doi:10.1007/978-3-030-05127-3_5

Outlier Detection in Categorical Data

N. N. R. Ranga Suri⁶,
Narasimha Murty M⁷ &
G. Athithan⁸

Chapter
First Online: 11 January 2019

1566 Accesses
1 Citations

Part of the book series: Intelligent Systems Reference Library ((ISRL,volume 155))

Abstract

This chapter delves on a specific research issue connected with outlier detection problem, namely type of data attributes. More specifically, the case of analyzing data described using categorical attributes/features is presented here. It is known that the performance of a detection algorithm directly depends on the way outliers are perceived. Typically, categorical data are processed by considering the occurrence frequencies of various attributes values. Accordingly, the objective here is to characterize the deviating nature of data objects with respect to individual attributes as well as in the joint distribution of two or more attributes. This can be achieved by defining the measure of deviation in terms of the attribute value frequencies. Also, cluster analysis provides valuable insights on the inherent grouping structure of the data that helps in identifying the deviating objects. Based on this understanding, this chapter presents algorithms developed for detection of outliers in categorical data.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 149.00; Price excludes VAT (USA)

Hardcover Book: USD 199.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

Bandhyopadhyay, S., Santra, S.: A genetic approach for efficient outlier detection in projected space. Pattern Recognit. 41, 1338–1349 (2008)
Google Scholar
Bock, H.H.: The classical data situation. In: Analysis of Symbolic Data, pp. 139–152. Springer (2002)
Google Scholar
Boriah, S., Chandola, V., Kumar, V.: Similarity measures for categorical data: a comparative evaluation. In: SIAM International Conference on Data Mining, Atlanta, Georgia, USA, pp. 243–254 (2008)
Google Scholar
Breunig, M., Kriegel, H., Ng, R., Sander, J.: Lof: identifying density-based local outliers. In: ACM SIGMOD International Conference on Management of Data, Dallas, Texas, pp. 93–104 (2000)
Google Scholar
Cao, F., Liang, J., Bai, L.: A new initialization method for categorical data clustering. Expert. Syst. Appl. 36, 10223–10228 (2009)
Article Google Scholar
Chandola, V., Banerjee, A., Kumar, V.: Anomaly detection: a survey. ACM Comput. Surv. 41(3) (2009)
Article Google Scholar
Chandola, V., Banerjee, A., Kumar, V.: Anomaly detection for discrete sequences: a survey. IEEE Trans. Knowl. Data Eng. (TKDE) 24(5), 823–839 (2012)
Article Google Scholar
Cui, Z., Ramanna, S., Peters, J.F., Pal, S.K.: Cognitive informatics and computational intelligence: theory and applications. Fundam. Inform. 124(1–2), v–viii (2013)
Google Scholar
Das, K., Schneider, J.: Detecting anomalous records in categorical datasets, San Jose, California. In: ACM KDD, pp. 220–229 (2007)
Google Scholar
Dua, D., Efi, K.T.: UCI machine learning repository (2017). http://archive.ics.uci.edu/ml
Duan, L., Xu, L., Liu, Y., Lee, J.: Cluster-based outlier detection. Ann. Oper. Res. 168, 151–168 (2009)
Article MathSciNet Google Scholar
Fawcett, T.: An introduction to ROC analysis. Pattern Recognit. Lett. 27, 861–874 (2006)
Article MathSciNet Google Scholar
Guha, S., Rastogi, R., Kyuseok, S.: ROCK: A robust clustering algorithm for categorical attributes. In: International Conference on Data Engineering (ICDE), Sydney, Australia, pp. 512–521 (1999)
Google Scholar
He, Z., Xu, X., Deng, S.: A fast greedy algorithm for outlier mining. In: Proceedings of Pacific Asia Conference on Knowledge Discovery in Databases (PAKDD), Singapore, pp. 567–576 (2006)
Google Scholar
He, Z., Xu, X., Deng, S.: Discovering cluster-based local outliers. Pattern Recognit. Lett. 24, 1641–1650 (2003)
Article Google Scholar
He, Z., Xu, X., Deng, S.: k-ANMI: a mutual information based clustering algorithm for categorical data. Inf. Fusion 9, 223–233 (2008)
Article Google Scholar
Hido, S., Tsuboi, Y., Kashima, H., Sugiyama, M., Kanamori, T.: Statistical outlier detection using direct density ratio estimation. Knowl. Inf. Syst. 26(2), 309–336 (2011)
Article Google Scholar
Hodge, V.J., Austin, J.: A survey of outlier detection methodologies. Artif. Intell. Rev. 22, 85–126 (2004)
Article Google Scholar
Huang, Z.: A fast clustering algorithm to cluster very large categorical data sets in data mining. In: SIGMOD Data Mining and Knowledge Discovery Workshop, pp. 1–8 (1997)
Google Scholar
Jain, A.K.: Data clustering: 50 years beyond K-means. Pattern Recognit. Lett. 31, 651–666 (2010)
Article Google Scholar
Jain, A.K., Duin, R.P.W., Mao, J.: Statistical pattern recognition: a review. IEEE Trans. Pattern Anal. Mach. Intell. 22, 4–37 (2000)
Article Google Scholar
Koufakou, A., Ortiz, E., Georgiopoulos, M.: A scalable and efficient outlier detection strategy for categorical data. In: Proceedings of IEEE ICTAI, Patras, Greece, pp. 210–217 (2007)
Google Scholar
Li, S., Lee, R., Lang, S.D.: Mining distance-based outliers from categorical data. In: IEEE ICDM Workshop, Omaha, Nebraska, pp. 225–230 (2007)
Google Scholar
Muller, E., Assent, I., Steinhausen, U., Seidl, T.: Outrank: ranking outliers in high dimensional data. In: IEEE ICDE Workshop, Cancun, Mexico, pp. 600–603 (2008)
Google Scholar
Ng, M.K., Li, M.J., Huang, J.Z., He, Z.: On the impact of dissimilarity measure in k-modes clustering algorithm. IEEE Trans. Pattern Anal. Mach. Intell. 29(3), 503–507 (2007)
Article Google Scholar
Suri, N.N.R.R., Murty, M., Athithan, G.: An algorithm for mining outliers in categorical data through ranking. In: 12th International Conference on Hybrid Intelligent Systems (HIS), pp. 247–252. IEEE Xplore, Pune, India (2012)
Google Scholar
Suri, N.N.R.R., Murty, M., Athithan, G.: Data mining techniques for outlier detection. In: Zhang, Q., Segall, R.S., Cao, M. (eds.) Visual Analytics and Interactive Technologies: Data, Text and Web Mining Applications, Chap. 2, pp. 22–38. IGI Global, New York, USA (2011)
Google Scholar
Suri, N.N.R.R., Murty, M., Athithan, G.: A ranking-based algorithm for detection of outliers in categorical data. Int. J. Hybrid Intell. Syst. (IJHIS) 11(1), 1–11 (2014)
Article Google Scholar
Taha, A., Hegazy, O.M.: A proposed outliers identification algorithm for categorical data sets. In: 7th International Conference on Informatics and Systems (INFOS), Cairo, Egypt, pp. 1–5 (2010)
Google Scholar
Wu, Q., Ma, S.: Detecting outliers in sliding window over categorical data streams. In: 8th International Conference on Fuzzy Systems and Knowledge Discovery (FSKD), pp. 1663–1667. IEEE (2011)
Google Scholar
Wu, S., Wang, S.: Information-theoretic outlier detection for large-scale categorical data. IEEE Trans. Knowl. Data Eng. (TKDE) 25(3), 589–602 (2013)
Article Google Scholar
Zengyou, H., Xiaofei, X., Shengchun, D.: Squeezer: an efficient algorithm for clustering categorical data. J. Comput. Sci. Technol. 17(5), 611–624 (2002)
Google Scholar
Zhang, K., Hutter, M., Jin, H.: A new local distance-based outlier detection approach for scattered real-world data. In: PAKDD, Bangkok, Thailand, pp. 813–822 (2009)
Chapter Google Scholar

Download references

Author information

Authors and Affiliations

Centre for Artificial Intelligence and Robotics (CAIR), Bangalore, India
N. N. R. Ranga Suri
Department of Computer Science and Automation, Indian Institute of Science (IISc), Bangalore, India
Narasimha Murty M
Defence Research and Development Organization (DRDO), New Delhi, India
G. Athithan

Authors

N. N. R. Ranga Suri
View author publications
You can also search for this author in PubMed Google Scholar
Narasimha Murty M
View author publications
You can also search for this author in PubMed Google Scholar
G. Athithan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to N. N. R. Ranga Suri .

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Ranga Suri, N.N.R., Murty M, N., Athithan, G. (2019). Outlier Detection in Categorical Data. In: Outlier Detection: Techniques and Applications. Intelligent Systems Reference Library, vol 155. Springer, Cham. https://doi.org/10.1007/978-3-030-05127-3_5

Download citation

DOI: https://doi.org/10.1007/978-3-030-05127-3_5
Published: 11 January 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-05125-9
Online ISBN: 978-3-030-05127-3
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)

Publish with us

Policies and ethics