Cluster Representation and Discrimination Based on Regression Line

Conference paper
Part of the Communications in Computer and Information Science book series (CCIS, volume 801)

Abstract

Clustering aims to group data into coherent groups based on the nearness of samples in multiple feature space where the coherency enriches the uniqueness of the clusters with respect to others. A cluster representation based on intrinsic cohesiveness would aid in efficient discrimination between clusters. In this paper, a novel method for representation and discrimination amongst the clusters has been proposed. Distances computed between every pair of samples in a cluster reveal the cohesiveness of samples in multi-dimensional space. As distances computed between every pair of samples enormously increase with the number of samples, distances are assimilated by histograms. The range of the bins in a histogram specifies the distance amongst the samples in a cluster. For effective discrimination, histograms are further transformed into a regression line by constructing cumulative histograms. Each cluster is represented by slope, intercept and error characterizing the regression line. The extent and angle of the slope is determined by the diameter of the cluster ranged by the bins and distribution of distances in the bins. To discriminate clusters represented by regression line, a statistical test called probability-value hypothesis testing is performed. Based on the probability obtained, the clusters are discriminated to be similar or dissimilar. Experimentation on real and synthetic clusters demonstrates the efficiency of the proposed approach in extracting unique cluster representation for discrimination.

Keywords

Cluster cohesion Representation Discrimination Histogram Regression 

References

  1. 1.
    Aggarwal, C.C., Reddy, C.K.: Data Clustering: Algorithms and Applications, p. 652. Data Mining and Knowledge Discovery Series. Chapman and Hall/CRC (2013)Google Scholar
  2. 2.
    Wurzenberger, M., Skopik, F., Landauer, M., Greitbauer, P., Fiedler, R., Kastner, W.: Incremental clustering for semi-supervised anomaly detection applied on log data. In: Proceedings of the 12th International Conference on Availability, Reliability and Security, Article No. 31. ACM (2017).  https://doi.org/10.1145/3098954.3098973
  3. 3.
    Burbeck, K., Nadjm-Tehrani, S.: Adaptive real-time anomaly detection with incremental clustering. Inf. Secur. Tech. Rep. 12(1), 56–67 (2007).  https://doi.org/10.1016/j.istr.2007.02.004CrossRefGoogle Scholar
  4. 4.
    Langone, R., Agudelo, O.M., De Moor, B., Suykens, J.A.K.: Incremental kernel spectral clustering for online learning of non-stationary data. Neurocomputing 139, 246–260 (2014).  https://doi.org/10.1016/j.neucom.2014.02.036CrossRefGoogle Scholar
  5. 5.
    Sun, Z., Mao, K.Z., Tang, W., Mak, L.-O., Xian, K., Liu, Y.: Knowledge-based evolving clustering algorithm for data stream. In: IEEE International Conference on Service Systems and Service Management, pp. 1–6 (2014).  https://doi.org/10.1109/icsssm.2014.6874031
  6. 6.
    Rendón, E., Abundez, I., Arizmendi, A., Quiroz, E.M.: Internal versus external cluster validation indexes. Int. J. Comput. Commun. 5(1), 27–34 (2011)Google Scholar
  7. 7.
    Arbelaitz, O., Gurrutxaga, I., Muguerza, J., Pérez, J.M., Perona, I.: An extensive comparative study of cluster validity indices. Pattern Recogn. 46(1), 243–256 (2013).  https://doi.org/10.1016/j.patcog.2012.07.021CrossRefGoogle Scholar
  8. 8.
    Dean, S., Illowsky, B.: Descriptive statistics: histogram. Retrieved from the Connexions Web site (2009). http://cnx.org/content/m16298/1.11
  9. 9.
  10. 10.
  11. 11.
    Anderson, M.J.: Permutational multivariate analysis of variance. Dept. Stat. Univ. Auckland 26, 32–46 (2005)Google Scholar
  12. 12.
    Cai, L.: Multi-response permutation procedure as an alternative to the analysis of variance: an SPSS implementation. Behav. Res. Methods 38(1), 51–59 (2006)CrossRefGoogle Scholar
  13. 13.
    Guillot, G., Rousset, F.: Dismantling the mantel tests. Methods Ecol. Evol. 4(4), 336–344 (2013).  https://doi.org/10.1111/2041-210x.12018CrossRefGoogle Scholar
  14. 14.
    Clarke, K.R.: Non-parametric multivariate analyses of changes in community structure. Austral Ecol. 18(1), 117–143 (1993).  https://doi.org/10.1111/j.1442-9993.1993.tb00438.xCrossRefGoogle Scholar
  15. 15.
    Aggarwal, C.C., Han, J., Wang, J., Yu, P.S.: A framework for clustering evolving data streams. In: Proceedings of the 29th International Conference on Very large Data bases, vol. 29, pp. 81–92. VLDB Endowment (2003)Google Scholar
  16. 16.
    Yang, C., Zhou, J.: HClustream: a novel approach for clustering evolving heterogeneous data stream. In: Sixth IEEE International Conference on Data Mining Workshops (2006).  https://doi.org/10.1109/icdmw.2006.89
  17. 17.
    Komkrit, U., Rakthanmanon, T., Waiyamai, K.: E-stream: evolution-based technique for stream clustering. In: Alhajj, R., Gao, H., Li, J., Li, X., Zaïane, O.R. (eds.) ADMA 2007. LNCS (LNAI), vol. 4632, pp. 605–615. Springer, Heidelberg (2007).  https://doi.org/10.1007/978-3-540-73871-8_58CrossRefGoogle Scholar
  18. 18.
    Meesuksabai, W., Kangkachit, T., Waiyamai, K.: Hue-stream: evolution-based clustering technique for heterogeneous data streams with uncertainty. In: Tang, J., King, I., Chen, L., Wang, J. (eds.) Advanced Data Mining and Applications, vol. 7121, pp. 27–40. Springer, Heidelberg (2011).  https://doi.org/10.1007/978-3-642-25856-5_3CrossRefGoogle Scholar
  19. 19.
    Nagabhushan, P., Ali, S.Z., Pradeep Kumar, R.: A new cluster-histo-regression analysis for incremental learning from temporal data chunks. Int. J. Mach. Intell. 2, 53–73 (2010).  https://doi.org/10.9735/0975-2927.2.1.53-57CrossRefGoogle Scholar
  20. 20.
    Deza, M.M., Deza, E.: Encyclopedia of distances. In: Deza, M.M., Deza, E. (eds.) Encyclopedia of distances, pp. 1–583. Springer, Heidelberg (2009).  https://doi.org/10.1007/978-3-642-00234-2CrossRefGoogle Scholar
  21. 21.
    Nagabhushan, P., Pradeep Kumar, R.: Histogram PCA. In: Liu, D., Fei, S., Hou, Z., Zhang, H., Sun, C. (eds.) ISNN 2007. LNCS, vol. 4492, pp. 1012–1021. Springer, Heidelberg (2007).  https://doi.org/10.1007/978-3-540-72393-6_120CrossRefGoogle Scholar
  22. 22.
    Kumar, R.P., Nagabhushan, P.: An approach based on regression line features for low complexity content based image retrieval. In: IEEE International Conference on Computing: Theory and Applications (2007).  https://doi.org/10.1109/iccta.2007.25
  23. 23.
    Wuensch, K.L.: Comparing correlation coefficients, slopes, and intercepts (2007). http://core.ecu.edu/psyc/wuenschk/docs30/CompareCorrCoeff.pdf
  24. 24.
    Grigelionis, B.: Student’s t-distribution. In: Lovric, M. (ed.) International Encyclopedia of Statistical Science, pp. 1558–1559. Springer, Heidelberg (2011).  https://doi.org/10.1007/978-3-642-04898-2CrossRefGoogle Scholar
  25. 25.
    Krzywinski, M., Altman, N.: Points of significance: significance, P values and t-tests. Nat. Methods 10(11), 1041–1042 (2013).  https://doi.org/10.1038/nmeth.2698CrossRefGoogle Scholar
  26. 26.
  27. 27.
    UCI Machine Learning Repository (2017). http://archive.ics.uci.edu/ml/

Copyright information

© Springer Nature Singapore Pte Ltd. 2018

Authors and Affiliations

  1. 1.BITBangaloreIndia
  2. 2.BNMITBangaloreIndia

Personalised recommendations