Skip to main content
Log in

Interpretability-based validity methods for clustering results evaluation

  • Published:
Journal of Intelligent Information Systems Aims and scope Submit manuscript

Abstract

Validation and interpretation are the two last steps of a clustering process. Generally these steps are processed separately since the existing validity measures are not intended to express the interpretability or the non interpretability of clusters. We propose in this paper to merge the validation and interpretation steps by using a new supervised measure that we call Homogeneity degree and which is based on the criterion of interpretability to validate clusters. We also present an extended version of this measure in order to improve its use as a relative measure.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Similar content being viewed by others

Notes

  1. \(H(P)=-\sum_{i=1}^{k}\frac{\parallel C_i \parallel}{n}log\frac{\parallel C_i \parallel}{n}\), \(H(L)=-\sum_{j=1}^{r}\frac{\parallel l_j \parallel}{n}log\frac{\parallel l_j \parallel}{n}\)

  2. The overall Purity of the partition in such situation is equal to 1

  3. The Section 7.4 presents a discussion about α and β

  4. The two values of DP α, β are close to each other according to the two closeness techniques

  5. The complexity of the algorithm depends on the complexity of CLUSTERING method.

  6. see http://archive.ics.uci.edu/ml/datasets.html

  7. The use of graph is suggested only if the measure exhibits an increasing or a decreasing when the number of clusters increases. In this case, we select the value of k that generates a significant local change that has the shape of “knee”.

References

  • Amigo, E., Gonzalo, J., Artiles, J., & Verdejo, F. (2009). A comparison of extrinsic clustreing evaluation metrics based on formal constraints. Information Retrieval, 12(4), 461–486.

    Article  Google Scholar 

  • Chinchor, N. (1992). Muc-4 evaluation metrics. In Proceedings of the 4th conference on Message understanding (MUC4 ’92) (pp. 22–29). http://www.aclweb.org/anthology-new/M/M92/M92-1002.pdf.

  • Davies, D. L., & Bouldin, D. W. (1979). Cluster separation measure. IEEE Transactions on Pattern Analysis and Machine Intelligence, 4(2), 224–227.

    Article  Google Scholar 

  • Dom, E. B. (2001). An information-theoretic external cluster-validity measure. Tech. rep., RJ10219, IBM.

  • Dongen, S. (2000). Performance criteria for graph clustering and markov cluster experiments. Tech. rep., Amsterdam, The Netherlands.

  • Dunn, J. C. (1974). Well separated clusters and optimal fuzzy partitions. Journal of Cybernetica, 4, 95–104.

    Article  MathSciNet  Google Scholar 

  • Fowlkes, E. B., & Mallows, C. L. (1983). A method for comparing two hierarchical clusterings. Journal of the American Statistical Association, 78(383), 553–569.

    MATH  Google Scholar 

  • Halkidi, M., Batistakis, Y., & Vazirgiannis, M. (2001). On clustering validation techniques. Journal of Intelligent Information Systems, 17(2), 107–145.

    Article  MATH  Google Scholar 

  • Kaufman, L., & Rousseeuw, P. (1990). Finding groups in data: An introduction to cluster analysis. John Wiley & Sons.

  • Larsen, B., & Aone, C. (1999). Fast and effective text mining using linear-time document clustering. In Proceedings of the fifth International Conference on Knowledge discovery and data mining (KDD’99) (pp. 16–22). New York, NY, USA. doi:10.1145/312129.312186.

  • Mcqueen, J. (1967). some methods for classification and analysis of multivariate observations. In 5th Berkeley Symp. on Math. Statistics and Probability (pp. 281–298). Berkley, USA.

  • Meilă, M. (2005). Comparing clusterings: an axiomatic view. In Proceedings of the 22nd International Conference on Machine Learning (ICML ’05) (pp. 577–584). Bonn, Germany.

  • Meilă, M. (2007). Comparing clusterings - an information based distance. Journal of Multivariate Analysis, 98(5), 873–895.

    Article  MathSciNet  MATH  Google Scholar 

  • Meilă, M., & Heckerman, D. (2001). An experimental comparison of model-based clustering methods. Machine Learning, 42(1–2), 9–29. doi:10.1023/A:1007648401407.

    Article  MATH  Google Scholar 

  • Milligan, G. W., Soon, S. C., & Sokol, L. M. (1983). The effect of cluster size, dimensionality and the number of clusters on recovery of true cluster structure. IEEE Transactions on Pattern Analysis and Machine Intelligence, 5, 40–47.

    Article  Google Scholar 

  • Mirkin, G. B. (1990). Mathematical classification and clustering. Kluwer Academic Press.

  • Naija, Y., & Sinaoui Blibech, K. (2009). A novel measure for validating clustering results applied to road traffic. In 3rd International Workshop on Knowledge Discovery from Sensor Data (SensorKDD-2009) (pp. 105–113). Paris, France.

  • Ng, R. T., & Han, J. (1994). Efficient and effective clustering methods for spatial data mining. In 20th Int. Conf. on Very Large DataBases (VLDB) (pp. 144–155). Santiago, Chile.

  • Rand, M. W. (1971). Objective criteria for the evaluation of clustering methods. Journal of the American Statistical Association, 66(336), 846–850.

    Google Scholar 

  • van Rijsbergen, C. J. (1979). Information Retrieval (2nd ed.). Butterworth, URL: http://www.dcs.gla.ac.uk/Keith/Preface.html.

  • Rosenberg, A., Hirschberg, J. (2007). V-measure: A conditional entropy-based external cluster evaluation measure. In Proceedings of of the 2007 Joint Conference on Empirical Methods in NLP and Computational Natural Language Learning (pp. 410–420). Prague.

  • Shannon, C. (1948). A mathematical theory of communication. Bell System Technical Journal, 27, 379–423, 623–656.

    MathSciNet  MATH  Google Scholar 

  • Tan, P. N., Steinbach, M., & Kumar, K. (2005). Introduction to Data Mining. Pearson Addison Wesley.

  • Wallace, L. D. (1983). A method for comparing two hierarchical clusterings: comment. Journal of the American Statistical Association, 78(383), 569–576.

    Google Scholar 

  • Zhao, Y., Karypis, & G. (2001). Criterion functions for document clustering: Experiments and analysis. Tech. rep., TR 01-40, Department of Computer Science, University of Minnesota, Minneapolis.

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yosr Naïja.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Naïja, Y., Sinaoui, K.B. Interpretability-based validity methods for clustering results evaluation. J Intell Inf Syst 39, 109–139 (2012). https://doi.org/10.1007/s10844-011-0185-0

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10844-011-0185-0

Keywords

Navigation