Abstract
In the field of data mining, clustering is one of the major issues. In the categorical clustering, data labeling has been acknowledged as an important method. The grouping of all the similar data points together is called as clustering. Those points which are not labeled earlier go through the data labeling process. For categorical data, very limited algorithms are applied, although there are many approaches in the numerical domain. In categorical domain, the most challenging issue is to allocate all the unlabeled data points into proper clusters. In this paper, a method is anticipated for labeling and maintaining the similar data points into proper clusters. We have a data set named US Census, where the data was collected as part of the 1990 census. There are 68 categorical attributes. This data set was derived from the US Census 1990 raw data set. The new proposal is to allocate each unlabeled data point into the equivalent proper cluster with data labeling also. It is much useful to understand the demographic survey of the public. This method has two rewards: (1) The proposed algorithm exhibits high execution efficiency. (2) This algorithm can achieve superiority clusters. The proposed algorithm is empirically validated on US Census data set, and it is shown considerably more efficient than previous works while attaining results of high quality.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Clustering Categorical Data Using Summaries, ACM SIGKDD (1999)
Jain, A.K., Dubes, R.C.: Algorithms for clustering data. Prentice-Hall International (1988)
Jain, A.K., Murthy, M.N., Flyn, P.J.: Data clustering: a review. ACM Comput. Surv. (1999)
Kaufman, L., Rousseuw, P.: Finding groups in data-an introduction to cluster analysis. In: Wiley Series in Probability and Mathematical Sciences (1990)
Guha, S., Rastogi, R., Shim, K.: CURE an efficient clustering algorithm for large databases. In: ACM SIGMOD International Conference on Management of Data, pp. 73–84 (1998)
Sreenivasuluy, G., Viswanadha Raju, S., et al.: A threshold for clustering concept—drifting categorical data. In: IEEE 3rd International Conference on Machine Learning and Computing (ICMLC), vol. 3, pp. 383–387. IEEE, Feb 2011. ISBN: 978-1-4244-9253/11
Han, J., Kamber, M.: Data mining concepts and techniques. Morgan Kaufmann (2001)
Ganti, V., Gehrke, J., Ramakrishnan, R.: CACTUS. In: Vapnik, V.N, The Nature of Statistical Learning Theory. Springer (1995)
Gibson, D., Kleinberg, J.M., Raghavan, P.: Clustering categorical data an approach based on dynamical systems. VLDB, pp. 3–4, 222–236 (2000)
Sreenivasulu, G., Viswanadha Raju, S., et al.: Data labeling method based on rough entropy for categorical data clustering. In: International Conference on Electronics, Communication and Computational Engineering—ICECCE 2014, pp. 383–387 (2014). ISBN: 978-1-1170-1175/11/ IEEE
Li, X., Rao, F.: An rough entropy based approach to outlier detection. J. Comput. Inf. Syst. 8(24), 10501–10508 (2012)
Parmer, D., Wu, T., Blackhurst, J.: MMR: an algorithm for clustering data using rough set theory. Data Knowl. Eng. 63(3), 879–893 (2007)
Klinkenberg, R.: Using labeled and unlabeled data to learn drifting concepts. In: IJCAI-01Workshop on Learning from Temporal and Spatial Data, pp. 16–24 (2001)
Sreenivasulu, G., Viswanadha Raju, S., et al.: A review of clustering techniques. In: International Conference on Data Engineering and Communication Technology (ICDECT). Springer, March 2016. ISSN: 2250-3439
Shannon, C.E.: A Mathematical Theory of Communication. Bell System Technical (1948)
Venkateswara Reddy, H., Viswanadha Raju, S.: A study in employing rough set based approach for clustering on categorical time-evolving data. IOSR J. Comput. Eng. (IOSRJCE) 3(5), 44–51 (2012). ISSN: 2278-0661, https://doi.org/10.9790/0661-0354451
Andritsos, P., Tsaparas, P., Miller, R.J., Sevcik, K.C.: LIMBO: scalable clustering of categorical data. Extend. Database Technol. (EDBT) (2004)
Farnstrom, F., Lewis, J., Elkan, C.: Scalability for clustering algorithms revisited., ACM SIGKDD, pp. 51–57 (2000)
Barbara, D., Li, Y., Couto, J.: COOLCAT: an entropy-based algorithm for categorical clustering, ACM International
Sreenivasulu, G., Viswanadha Raju, S., et al.: A Comparative study of node importance in categorical clustering. Int. J. Adv. Eng. Global Technol. (IJAEGT) 1(1), 784–788 (2013). ISSN: ISSN No:2309-4893 (print)|ISSN: 0975-397(online)
Bradley, P.S., Fayyad, U., Reina, C.: Scaling clustering algorithms to large databases. In: Fourth International Conference on Knowledge Discovery and Data Mining (1998)
Ghosh, J.: Scalable clustering methods for data mining. In: Ye, N. (ed.), Handbook of Data Mining, Chap. 10, pp. 247–277. Lawrence Ealbaum Assoc (2003)
Chen, H.L., Chuang, K.T., Chen, M.S.: On data labeling for clustering categorical data. IEEE Trans. Knowl. Data Eng. 20(2011), 1458–1471 (2008)
Cao, Fuyuan, Liang, Jiye: A Data Labeling method for clustering categorical data. Elsevier Expert systems with applications 38, 2381–2385 (2011)
Chen, H.L., Chuang, K.T., Chen, M.S.: Labeling un clustered categorical data into clusters based on the important attribute values. In: IEEE International Conference. Data Mining (ICDM) (2005)
Pawlak, W.: Rough sets. Int. J. Comput. Inf. Sci. 11, 341–356 (1982)
Gluck, M.A., Corter, J.E.: Information uncertainty and the utility of categories. Cognit. Sci. Soc. 283–287 (1985)
Chen, C.-B., Wang, L.-Y.: Rough set-based clustering with refinement using Shannon’s entropy theory. ELSEVIER Comput. Math. Appl. 52, 1563–1576 (2006)
Liang, J.Y., Wang, J.H., Qian, Y.H.: A new measure of uncertainty based on knowledge granulation for rough sets. Inf. Sci. 179(4), 458–470 (2009)
Jiang, F., Sui, Y.F., Cao, C.G.: A rough set approach to outlier detection. Int. J. Gen Syst 37(5), 519–536 (2008)
Sreenivasulu, G., Viswanadha Raju, S.: A proficient approach for clustering of large categorical data cataloguing. In: International Conference on Electrical, Electronics, and Optimization Techniques (ICEEOT). IEEE, March 2016. ISSN 978-1-4673-9939
Sreenivasulu, G., Venkateswara Reddy, H., Viswanadha Raju, S.: A threshold for clustering concept—drifting categorical data. IEEE Comput. Soc. ICMLC (2011)
Zhang, T., Ramakrishnan, R., Livny, M.: BIRCH: an efficient data clustering method for very large databases. In: ACM SIGMOD International Conference on Management of Data (1996)
Ng, R.T., Han, J.: CLARANS: a method for clustering objects for spatial data mining. Knowl. Data Eng., IEEE Transactions (2002)
Huang, Z., Ng, M.K.: A fuzzy k-modes algorithm for clustering categorical data. IEEE Fuzzy Syst. (1999)
Guha, S., Rastogi, R., Shim, K.: ROCK: a robust clustering algorithm for categorical attributes. In: International Conference On Data Eng. (ICDE) (1999)
Conf. Information and Knowledge Management (CIKM) (2002)
Sreenivasulu, G., Viswanadha Raju, S., et al.: Graph Based approach for clustering categorical data. Int. J. Adv. Comput. (IJAC) 117–125. ISBN: ISSN: 0975-7686
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Sreenivasulu, G., Viswanadha Raju, S., Sambasiva Rao, N. (2019). An Efficient Approach for Clustering US Census Data Based on Cluster Similarity Using Rough Entropy on Categorical Data. In: Fong, S., Akashe, S., Mahalle, P. (eds) Information and Communication Technology for Competitive Strategies. Lecture Notes in Networks and Systems, vol 40. Springer, Singapore. https://doi.org/10.1007/978-981-13-0586-3_37
Download citation
DOI: https://doi.org/10.1007/978-981-13-0586-3_37
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-13-0585-6
Online ISBN: 978-981-13-0586-3
eBook Packages: EngineeringEngineering (R0)