Abstract
In this work, we developed and experimentally validated a novel model for external clustering validation to deal with huge data sets using Conditional Entropy index. The model allows clustering validation in a parallel and a distributed manner using Map-Reduce framework, it is termed MR-Centropy. The aim is to be able to scale with increasing dataset sizes when ground truth clustering is available. The proposed MR-Centropy is a three-jobs process where each job consists of Map and Reduce functions. Three jobs were necessary to gather all the statistics involved in the computation of the Conditional Entropy index. Each step in the proposed framework is done in parallel. Numerical tests on real and synthetic datasets demonstrate the effectiveness of our proposed model.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Davidson, I., Ravi, S.S., Shamis, L.: A SAT-based framework for efficient constrained clustering. In: The Proceedings of the 10th SIAM International Conference on Data Mining, pp. 94–105 (2010)
MacQueen, J.: Some methods for classification and analysis of multivariate observations. In: 5th Berkeley Symposium on Mathematics, Statistics and Probability, pp 281–296 (1967)
Ester, M., Kriegel, H.-P., Sander, J., Xu, X.: A density based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (KDD), Portland, pp. 226–231 (1996)
Wang, W., Yang, J., Muntz, R.: STING: a statistical information grid approach to spatial data mining. In: Proceedings of the 23rd International Conference on Very Large Data Bases (VLDB), pp. 186–195. Morgan Kaufmann Publishers, Athens (1997)
Tian, Z., Raghu, R., Miron, L.: BIRCH: an efficient data clustering method for very large databases. In: Proceedings of the Conference of Data Management, pp. 103–114. ACM SIGMOD, Montreal (1996)
Xiong, H., Li, Z.: Clustering validation measures. In: Aggarwal, C.C., Reddy, C.K. (eds.). Data Clust. Algorithms Appl., vol. 43(3), pp. 571–605. CRC, Boca Raton (2014)
Santibanez, M., Valdovinos, R.-M., Truebam, A., Rendon, E., Alejo, R., Lopez, E.: Applicability of cluster validation indexes for large data sets. In: The 12th Mexican International Conference on Artificial Intelligence, pp. 187–193. IEEE, Mexico (2013)
Campo, D.N., Stegmayer, G., Milone, D.H.: A new index for clustering validation with overlapped clusters, pp. 549–556. Elsevier (2016)
Halkidi, M., Batistakis, Y., Vazirgiannis, M.: On clustering validation techniques. J. Intell. Inf. Syst. 17(2/3), 107–145 (2001)
Wu, J., Xiong, H., Chen, J.: Adapting the right measures for k-means clustering. In: Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Paris France, pp. 877–886 (2009)
Liu, Y., Li, Z., Xiong, H., Gao, X., Wu, J., Wu, S.: Understanding and enhancement of internal clustering validation measures. IEEE Trans. Cybernet. 43(3), 982–993 (2013)
Rousseeuw, P.J.: Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 20(1), 53–65 (1987)
Davies, D.L., Bouldin, D.W.: A cluster separation measure. IEEE Trans. Patt Anal. Mach. Intell. 2, 224–227 (1979)
Dunn, J.: Well separated clusters and optimal fuzzy partitions. J. Cybernet. Syst. 4(1), 95–104 (1974)
Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)
Rendón, E., Abundez, I., Arizmendi, A., Quiroz, E.M.: Internal versus external cluster validation indexes. Int. J. Comput. Commun. 5(1), 27–34 (2011)
Zaki Mohamed, J., Wagner, M.J.R.: Data Mining and Analysis, 1st edn. Cambridge University Press, Cambridge (2014)
Zerabi, S., Meshoul, S., Merniz, A., Melal, R.: Towards clustering validation in big data context. In: Proceedings of the 2nd International Conference on Big Data, Cloud and Applications, pp. 28–33. ACM, Tetouan (2017)
Zerabi, S., Meshoul, S.: External clustering validation in Big Data context. In: Proceedings of the 3nd International Conference on Cloud Computing Technologies and Applications. IEEE, Rabat Morocco (2017)
Apache Hadoop. http://hadoop.apache.org/. Accessed 12 Aug 2017
White, T.: Hadoop: The Definitive Guide Storage and Analysis at Internet Scale, 4th edn. O’Reilly Media, Sebastopol (2015)
Oussous, A., Benjelloun, F.Z., AitLahcen, A., Belfkih, S.: Big data technologies: a survey. J. King Saud Univ. Comput. Inf. Sci. (2017)
Chullipparambil, C.P.: Big data analytics using hadoop tools. Ph.D. thesis San Diego State University (2016)
White, T.: Hadoop: The Definitive Guide, 3rd edn. O’Reilly Media, Sebastopol (2012)
Ibrahim, A., Hashem, T., Anuar, N.B., Gani, A., Yaqoob, I., Xia, F., Khan, S.U.: MapReduce: review and open challenges. Scientometrics 109, 389–422 (2016)
Ha, L.K., Hyansik, C., Bongki, M., Lee, Y.J., Chung, Y.D.: Parallel data processing with MapReduce: a survey. SIGMOD Rec. 40(4), 11–20 (2011)
Machine learning repository. http://archive.ics.uci.edu/ml/datasets.html
Handl, J., Knowles, J.: Improvements to the scalability of multi objective clustering. IEEE Congr. Evol. Comput. 3, 2372–2379 (2005)
Acknowledgment
This work has been supported by the National Research Project CNEPRU under grant N: B*07120140037.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Zerabi, S., Meshoul, S., Khantoul, B. (2019). Parallel Clustering Validation Based on MapReduce. In: Demigha, O., Djamaa, B., Amamra, A. (eds) Advances in Computing Systems and Applications. CSA 2018. Lecture Notes in Networks and Systems, vol 50. Springer, Cham. https://doi.org/10.1007/978-3-319-98352-3_31
Download citation
DOI: https://doi.org/10.1007/978-3-319-98352-3_31
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-98351-6
Online ISBN: 978-3-319-98352-3
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)