Abstract
Clustering and Classification are significant and widely used task in data mining. Their incorporation together is rare. When we integrate them together they can give more promising, accurate and robust results compare to - unaccompanied. The integration of these methods can be done by an ensemble method or hybrid method. This paper uses a hybrid model; K-means clustering method for the preprocessing of the data. Pre-learning by K-means clustering keeps similar cases in the same group. This improves the on-hand classifier’s performance. To demonstrate applicability of this new hybrid approach the experiments on PIMA diabetic datasets from UCI repository were conducted and the results are compared on several parameters. Clustering before classification provides an added description to the data and improves the effectiveness of the classification task. This model can be deployed with any classification algorithms to improve its performance.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Han, J., Kamber, M.: Data Mining: Concepts and Techniques. Morgan Kauffmann Publishers, San Francisco (2001)
Elrahman, S.M.A., Abraham, A.: A review of class imbalance problem. J. Netw. Innovative Comput. 1, 332–340 (2013). ISSN 2160-2174
Karegowda, A.G., et al.: Cascading K-means clustering and K-nearest neighbor classifier for categorization of diabetic patients. Int. J. Eng. Adv. Technol. (IJEAT) 1(3), 147–151 (2012). ISSN: 2249 – 8958
Kyriakopoulou, A.: Text classification aided by clustering: a literature review. In: Fritzsche, P. (ed.) Tools in Artificial Intelligence (2008). ISBN: 978-953-7619-03-9
Zeng, H.-J., et al.: CBC: clustering based text classification requiring minimal labeled. In: Proceedings of the Third IEEE International Conference on Data Mining (ICDM 2003). IEEE (2003)
Zehra, A.: A comparative study on the pre-processing and mining of Pima Indian Diabetes Dataset. In: ICSEC 2014: The International Computer Science and Engineering Conference (ICSEC), pp. 1–10 (2014)
Shekhar, R., et al.: K-means + ID3: a novel method for supervised anomaly detection by cascading K-means clustering and ID3 decision tree learning methods. IEEE Trans. Knowl. Data Eng. 19(3), 345–354 (2007)
Buana, P.W., Jannet, S.L., et al.: Combination of K-nearest neighbor and K-means based on term re-weighting for classify Indonesian news. Int. J. Comput. Appl. 50(11), 37–42 (2012)
Ahmed, M.S., Khan, L.: SISC: a text classification approach using semi-supervised subspace clustering. In: 2009 IEEE International Conference on Data Mining Workshops (2009)
López, M.I., Luna, J.M., Romero, C., Ventura, S.: Classification via clustering for predicting final marks based on student participation in forums. In: Proceedings of the 5th International Conference on Educational Data Mining (2012)
Kyriakopoulou, A., Kalamboukis, T.: Combining clustering with classification for spam detection in social bookmarking systems. In: ECML/PKDD 2008 Discovery Challenge (2008)
Davis, J., Goadrich, M.: The relationship between precision-recall and ROC curves. In: 23rd International Conference on Machine Learning, Pittsburgh, PA (2006)
Sumana, B.V., Santhanam, T.: Prediction of diseases by cascading clustering and classification. In: International Conference on Advances in Electronics, Computers, and Communications (ICAECC). IEEE (2014)
Yong, Z., Li, Y., Shixiong, X.: An improved KNN text classification algorithm based on clustering. J. Comput. 4(3), 230–237 (2009)
Breault, J.L.: Data mining diabetic databases: are rough sets a useful addition? (2001). http://www.galaxy.gmu.edu/interface/I01/I2001Proceedings/Jbreault
Cover, T., Hart, P.: Nearest neighbor pattern classification. IEEE Trans. Inf. Theory 13, 21–27 (1967)
Witten, I.H., et al.: Weka: practical machine learning tools and techniques with Java implementations. (Working paper 99/11). Department of Computer Science, University of Waikato, Hamilton, New Zealand (1999)
loizou, G., Maybank, S.J.: The nearest neighbor and the bayes error rates. IEEE Trans. Pattern Anal. Mach. Learn. 9, 254–263 (1987)
Jain, A.K., Murty, M.N., Flynn, P.J.: Data clustering a review. ACM Comput. Surv. (CSUR) 31, 264–323 (1999)
UCI machine learning repository. http://archive.ics.uci.edu/ml
Weka Data mining with open source machine learning software. http://www.cs.waikato.ac.nz/ml/weka/
Fayyad, U.M., Smyth, P.: Advances in Knowledge Discovery and Data Mining. AAAI/MIT Press, Menlo Park (1996)
Boudour, M., Hellal, A.: Combined use of supervised and unsupervised learning for power system dynamic security mapping. Eng. Appl. Artif. Intell. 18, 673–683 (2005)
King, R.D., Feng, C., Sutherland, A.: Comparison of classification algorithms on large real-world problems. Appl. Artif. Intell. 9(3), 289–333 (1995)
Kaufman, L., Rousseeuw, P.J.: Finding Groups in Data: An Introduction to Cluster Analysis. Wiley, New York (1990)
Lim, T., Loh, W., Shih, Y.: A comparison of prediction accuracy, complexity, and training time of thirty-three old and new classification algorithms. Mach. Learn. 40, 203–228 (2000)
Guha, S., Rastogi, R., Shim, K.: Cure: an efficient clustering algorithm for large databases. In: Proceedings of the ACM-SIGMOD International Conference Management of Data (SIGMOD 1998), pp. 73–84 (1998)
EL-Manzalawy, Y., Honavar, V.: LSVM: integrating LibSVM into Weka environment (2005). http://www.cs.iastate.edu/~yasser/wlsvm
Rastogi, R., Shim, K.: Public: a decision tree classifier that integrates building and pruning. In: Proceedings of the 24th International Conference on Very Large Data Bases, pp. 404–415 (1998)
Mehta, M., Agrawal, R., Rissanen, J.: SLIQ: a fast scalable classifier for data mining. In: Apers, P., Bouzeghoub, M., Gardarin, G. (eds.) EDBT 1996. LNCS, vol. 1057, pp. 18–32. Springer, Heidelberg (1996). https://doi.org/10.1007/BFb0014141
Li, Y., Hung, E., Chung, K., Huang, J.: Building a decision cluster classification model for high dimensional data by a variable weighting k-means method. In: Wobcke, W., Zhang, M. (eds.) AI 2008. LNCS (LNAI), vol. 5360, pp. 337–347. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-89378-3_33
Mac Queen, J.: Some methods for classification and analysis of multivariate observations. In: Proceedings of the 5th Berkeley Symposium Mathematical Statistics, pp. 281–297 (1967)
Kaur, G., Chhabra, A.: Improved J48 classification algorithm for the prediction of diabetes. Int. J. Comput. Appl. (0975 – 8887) 98(22), 13–17 (2014)
Ashwin Kumar, U.M., Ananda Kumar, KR.: Predicting early detection of cardiac and diabetes symptoms using data mining techniques. In: IEEE, pp. 161–165 (2011)
Hardin, J.M., Chhieng, D.C.: Data mining and clinical decision support systems. In: Hannah, K.J., Ball, M.J. (eds.) Clinical Decision Support Systems. Health Informatics. Springer, Cham (2007). https://doi.org/10.1007/978-0-387-38319-4_3
Pao, Y., Sobajic, D.J.: Combined use of unsupervised and supervised learning for dynamic security assessment. Trans. Power Syst. 7(2), 878–884 (1992)
Smuc, T., Gamberger, D., Krstacic, G.: Combining unsupervised and supervised machine learning in analysis of the CHD patient database. In: Quaglini, S., Barahona, P., Andreassen, S. (eds.) AIME 2001. LNCS, vol. 2101, pp. 109–112. Springer, Heidelberg (2001). https://doi.org/10.1007/3-540-48229-6_14
Delen, D., Walker, G., Kadam, A.: Predicting breast cancer survivability: a comparison of three data mining methods. Artif. Intell. Med. 34, 113–127 (2005)
Namburu, S.M., Tu, H., Luo, J., Pattipati, K.R.: Experiments on supervised learning algorithms for text categorization. In: 2005 IEEE Aerospace Conference (2005)
Huang, A.: Similarity measures for text document clustering. In: The New Zealand Computer Science Research Student Conference (2008)
Kesavaraj, G., Sukumaran, S.: A study on classification techniques in data mining. In: 2013 Fourth International Conference on Computing, Communications and Networking Technologies (ICCCNT) (2013)
Smitha, T., Sundaram, V.: Comparative study of data mining algorithms for high dimensional data analysis. Int. J. Adv. Eng. Technol. 4, 173 (2012). IJAET ISSN: 2231-1963
Bhargavi, P., Jyothi, S.: Soil classification using data mining techniques: a comparative study. Int. J. Eng. Trends Technol. 2 (2011)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Gupta, S., Parekh, B., Jivani, A. (2019). A Hybrid Model of Clustering and Classification to Enhance the Performance of a Classifier. In: Luhach, A., Jat, D., Hawari, K., Gao, XZ., Lingras, P. (eds) Advanced Informatics for Computing Research. ICAICR 2019. Communications in Computer and Information Science, vol 1076. Springer, Singapore. https://doi.org/10.1007/978-981-15-0111-1_34
Download citation
DOI: https://doi.org/10.1007/978-981-15-0111-1_34
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-15-0110-4
Online ISBN: 978-981-15-0111-1
eBook Packages: Computer ScienceComputer Science (R0)