A Hybrid Model of Clustering and Classification to Enhance the Performance of a Classifier

Gupta, Subodhini; Parekh, Bhushan; Jivani, Anjali

doi:10.1007/978-981-15-0111-1_34

Subodhini Gupta¹²,
Bhushan Parekh¹³ &
Anjali Jivani¹³

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1076))

Included in the following conference series:

International Conference on Advanced Informatics for Computing Research

657 Accesses

Abstract

Clustering and Classification are significant and widely used task in data mining. Their incorporation together is rare. When we integrate them together they can give more promising, accurate and robust results compare to - unaccompanied. The integration of these methods can be done by an ensemble method or hybrid method. This paper uses a hybrid model; K-means clustering method for the preprocessing of the data. Pre-learning by K-means clustering keeps similar cases in the same group. This improves the on-hand classifier’s performance. To demonstrate applicability of this new hybrid approach the experiments on PIMA diabetic datasets from UCI repository were conducted and the results are compared on several parameters. Clustering before classification provides an added description to the data and improves the effectiveness of the classification task. This model can be deployed with any classification algorithms to improve its performance.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Han, J., Kamber, M.: Data Mining: Concepts and Techniques. Morgan Kauffmann Publishers, San Francisco (2001)
MATH Google Scholar
Elrahman, S.M.A., Abraham, A.: A review of class imbalance problem. J. Netw. Innovative Comput. 1, 332–340 (2013). ISSN 2160-2174
Google Scholar
Karegowda, A.G., et al.: Cascading K-means clustering and K-nearest neighbor classifier for categorization of diabetic patients. Int. J. Eng. Adv. Technol. (IJEAT) 1(3), 147–151 (2012). ISSN: 2249 – 8958
Google Scholar
Kyriakopoulou, A.: Text classification aided by clustering: a literature review. In: Fritzsche, P. (ed.) Tools in Artificial Intelligence (2008). ISBN: 978-953-7619-03-9
Google Scholar
Zeng, H.-J., et al.: CBC: clustering based text classification requiring minimal labeled. In: Proceedings of the Third IEEE International Conference on Data Mining (ICDM 2003). IEEE (2003)
Google Scholar
Zehra, A.: A comparative study on the pre-processing and mining of Pima Indian Diabetes Dataset. In: ICSEC 2014: The International Computer Science and Engineering Conference (ICSEC), pp. 1–10 (2014)
Google Scholar
Shekhar, R., et al.: K-means + ID3: a novel method for supervised anomaly detection by cascading K-means clustering and ID3 decision tree learning methods. IEEE Trans. Knowl. Data Eng. 19(3), 345–354 (2007)
Article Google Scholar
Buana, P.W., Jannet, S.L., et al.: Combination of K-nearest neighbor and K-means based on term re-weighting for classify Indonesian news. Int. J. Comput. Appl. 50(11), 37–42 (2012)
Google Scholar
Ahmed, M.S., Khan, L.: SISC: a text classification approach using semi-supervised subspace clustering. In: 2009 IEEE International Conference on Data Mining Workshops (2009)
Google Scholar
López, M.I., Luna, J.M., Romero, C., Ventura, S.: Classification via clustering for predicting final marks based on student participation in forums. In: Proceedings of the 5th International Conference on Educational Data Mining (2012)
Google Scholar
Kyriakopoulou, A., Kalamboukis, T.: Combining clustering with classification for spam detection in social bookmarking systems. In: ECML/PKDD 2008 Discovery Challenge (2008)
Google Scholar
Davis, J., Goadrich, M.: The relationship between precision-recall and ROC curves. In: 23rd International Conference on Machine Learning, Pittsburgh, PA (2006)
Google Scholar
Sumana, B.V., Santhanam, T.: Prediction of diseases by cascading clustering and classification. In: International Conference on Advances in Electronics, Computers, and Communications (ICAECC). IEEE (2014)
Google Scholar
Yong, Z., Li, Y., Shixiong, X.: An improved KNN text classification algorithm based on clustering. J. Comput. 4(3), 230–237 (2009)
Google Scholar
Breault, J.L.: Data mining diabetic databases: are rough sets a useful addition? (2001). http://www.galaxy.gmu.edu/interface/I01/I2001Proceedings/Jbreault
Cover, T., Hart, P.: Nearest neighbor pattern classification. IEEE Trans. Inf. Theory 13, 21–27 (1967)
Article Google Scholar
Witten, I.H., et al.: Weka: practical machine learning tools and techniques with Java implementations. (Working paper 99/11). Department of Computer Science, University of Waikato, Hamilton, New Zealand (1999)
Google Scholar
loizou, G., Maybank, S.J.: The nearest neighbor and the bayes error rates. IEEE Trans. Pattern Anal. Mach. Learn. 9, 254–263 (1987)
Article Google Scholar
Jain, A.K., Murty, M.N., Flynn, P.J.: Data clustering a review. ACM Comput. Surv. (CSUR) 31, 264–323 (1999)
Article Google Scholar
UCI machine learning repository. http://archive.ics.uci.edu/ml
Weka Data mining with open source machine learning software. http://www.cs.waikato.ac.nz/ml/weka/
Fayyad, U.M., Smyth, P.: Advances in Knowledge Discovery and Data Mining. AAAI/MIT Press, Menlo Park (1996)
Google Scholar
Boudour, M., Hellal, A.: Combined use of supervised and unsupervised learning for power system dynamic security mapping. Eng. Appl. Artif. Intell. 18, 673–683 (2005)
Article Google Scholar
King, R.D., Feng, C., Sutherland, A.: Comparison of classification algorithms on large real-world problems. Appl. Artif. Intell. 9(3), 289–333 (1995)
Article Google Scholar
Kaufman, L., Rousseeuw, P.J.: Finding Groups in Data: An Introduction to Cluster Analysis. Wiley, New York (1990)
Book Google Scholar
Lim, T., Loh, W., Shih, Y.: A comparison of prediction accuracy, complexity, and training time of thirty-three old and new classification algorithms. Mach. Learn. 40, 203–228 (2000)
Article Google Scholar
Guha, S., Rastogi, R., Shim, K.: Cure: an efficient clustering algorithm for large databases. In: Proceedings of the ACM-SIGMOD International Conference Management of Data (SIGMOD 1998), pp. 73–84 (1998)
Article Google Scholar
EL-Manzalawy, Y., Honavar, V.: LSVM: integrating LibSVM into Weka environment (2005). http://www.cs.iastate.edu/~yasser/wlsvm
Rastogi, R., Shim, K.: Public: a decision tree classifier that integrates building and pruning. In: Proceedings of the 24th International Conference on Very Large Data Bases, pp. 404–415 (1998)
Google Scholar
Mehta, M., Agrawal, R., Rissanen, J.: SLIQ: a fast scalable classifier for data mining. In: Apers, P., Bouzeghoub, M., Gardarin, G. (eds.) EDBT 1996. LNCS, vol. 1057, pp. 18–32. Springer, Heidelberg (1996). https://doi.org/10.1007/BFb0014141
Chapter Google Scholar
Li, Y., Hung, E., Chung, K., Huang, J.: Building a decision cluster classification model for high dimensional data by a variable weighting k-means method. In: Wobcke, W., Zhang, M. (eds.) AI 2008. LNCS (LNAI), vol. 5360, pp. 337–347. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-89378-3_33
Chapter Google Scholar
Mac Queen, J.: Some methods for classification and analysis of multivariate observations. In: Proceedings of the 5th Berkeley Symposium Mathematical Statistics, pp. 281–297 (1967)
Google Scholar
Kaur, G., Chhabra, A.: Improved J48 classification algorithm for the prediction of diabetes. Int. J. Comput. Appl. (0975 – 8887) 98(22), 13–17 (2014)
Google Scholar
Ashwin Kumar, U.M., Ananda Kumar, KR.: Predicting early detection of cardiac and diabetes symptoms using data mining techniques. In: IEEE, pp. 161–165 (2011)
Google Scholar
http://www.cs.waikato.ac.nz/ml/weka/
http://transact.dl.sourceforge.net/sourcefor
Hardin, J.M., Chhieng, D.C.: Data mining and clinical decision support systems. In: Hannah, K.J., Ball, M.J. (eds.) Clinical Decision Support Systems. Health Informatics. Springer, Cham (2007). https://doi.org/10.1007/978-0-387-38319-4_3
Chapter Google Scholar
Pao, Y., Sobajic, D.J.: Combined use of unsupervised and supervised learning for dynamic security assessment. Trans. Power Syst. 7(2), 878–884 (1992)
Article Google Scholar
Smuc, T., Gamberger, D., Krstacic, G.: Combining unsupervised and supervised machine learning in analysis of the CHD patient database. In: Quaglini, S., Barahona, P., Andreassen, S. (eds.) AIME 2001. LNCS, vol. 2101, pp. 109–112. Springer, Heidelberg (2001). https://doi.org/10.1007/3-540-48229-6_14
Chapter Google Scholar
Delen, D., Walker, G., Kadam, A.: Predicting breast cancer survivability: a comparison of three data mining methods. Artif. Intell. Med. 34, 113–127 (2005)
Article Google Scholar
Namburu, S.M., Tu, H., Luo, J., Pattipati, K.R.: Experiments on supervised learning algorithms for text categorization. In: 2005 IEEE Aerospace Conference (2005)
Google Scholar
Huang, A.: Similarity measures for text document clustering. In: The New Zealand Computer Science Research Student Conference (2008)
Google Scholar
Kesavaraj, G., Sukumaran, S.: A study on classification techniques in data mining. In: 2013 Fourth International Conference on Computing, Communications and Networking Technologies (ICCCNT) (2013)
Google Scholar
Smitha, T., Sundaram, V.: Comparative study of data mining algorithms for high dimensional data analysis. Int. J. Adv. Eng. Technol. 4, 173 (2012). IJAET ISSN: 2231-1963
Google Scholar
Bhargavi, P., Jyothi, S.: Soil classification using data mining techniques: a comparative study. Int. J. Eng. Trends Technol. 2 (2011)
Google Scholar

Download references

Author information

Authors and Affiliations

Computer Science Department, NRI Group of Institutes, Bhopal, M.P., India
Subodhini Gupta
Department of Computer Science and Engineering, Maharaja Sayajirao University of Baroda, Vadodara, Gujarat, India
Bhushan Parekh & Anjali Jivani

Authors

Subodhini Gupta
View author publications
You can also search for this author in PubMed Google Scholar
Bhushan Parekh
View author publications
You can also search for this author in PubMed Google Scholar
Anjali Jivani
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Subodhini Gupta .

Editor information

Editors and Affiliations

Papua New Guinea University of Technology, Lae, Papua New Guinea
Ashish Kumar Luhach
Computer Science Department, Namibia University of Science and Technology, Windhoek, Namibia
Dharm Singh Jat
Universiti Malaysia Pahang, Pekan, Pahang, Malaysia
Kamarul Bin Ghazali Hawari
School of Computing, University of Eastern Finland, Kuopio, Finland
Xiao-Zhi Gao
Department of Mathematics and Computing Science, Saint Mary's University, Halifax, Canada
Pawan Lingras

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Gupta, S., Parekh, B., Jivani, A. (2019). A Hybrid Model of Clustering and Classification to Enhance the Performance of a Classifier. In: Luhach, A., Jat, D., Hawari, K., Gao, XZ., Lingras, P. (eds) Advanced Informatics for Computing Research. ICAICR 2019. Communications in Computer and Information Science, vol 1076. Springer, Singapore. https://doi.org/10.1007/978-981-15-0111-1_34

Download citation

DOI: https://doi.org/10.1007/978-981-15-0111-1_34
Published: 17 September 2019
Publisher Name: Springer, Singapore
Print ISBN: 978-981-15-0110-4
Online ISBN: 978-981-15-0111-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics