Ensembles of Cluster Validation Indices for Label Noise Filtering

  • Jan Kohstall
  • Veselka BoevaEmail author
  • Lars Lundberg
  • Milena Angelova
Part of the Studies in Computational Intelligence book series (SCI, volume 864)


Cluster validation measures are designed to find the partitioning that best fits the underlying data. In this study, we show that these measures can be used for identifying mislabeled instances or class outliers prior to training in supervised learning problems. We introduce an ensemble technique, entitled CVI-based Outlier Filtering, which identifies and eliminates mislabeled instances from the training set, and then builds a classification hypothesis from the set of remaining instances. Our approach assigns to each instance in the training set several cluster validation scores representing its potential of being a class outlier with respect to the clustering properties the used validation measures assess. In this respect, the proposed approach may be referred to a multi-criteria outlier filtering measure. In this work, we specifically study and evaluate valued-based ensembles of cluster validation indices. The added value of this approach in comparison to the logical and rank-based ensemble solutions are discussed and further demonstrated.



This work is part of the research project “Scalable resource efficient systems for big data analytics” funded by the Knowledge Foundation (grant: 20140032) in Sweden.


  1. 1.
    C.C. Aggarwal, Outlier ensembles: Position paper. ACM SIGKDD Explor. Newsl. 14(2), 49–58 (2013)CrossRefGoogle Scholar
  2. 2.
    A.E. Bayá, P.M. Granitto, How many clusters: A validation index for arbitrary-shaped clusters. IEEE/ACM Trans. Comput. Biol. Bioinform. (TCBB) 10(2), 401–414 (2013)CrossRefGoogle Scholar
  3. 3.
    J. Bezdek, N. Pal, Some new indexes of cluster validity. IEEE Trans. Syst. Man Cybern. Part B (Cybern.) 28(3), 301–315 (1998)CrossRefGoogle Scholar
  4. 4.
    V. Boeva, J. Kohstall, L. Lundberg, M. Angelova, Combining cluster validation indices for detecting label noise, in Archives of Data Science, Series A, p. submitted (2018)Google Scholar
  5. 5.
    V. Boeva, L. Lundberg, M. Angelova, J. Kohstall, Cluster validation measures for label noise filtering, in 9th IEEE International Conference on Intelligent Systems (IS’18), pp. 109–116 (2018)Google Scholar
  6. 6.
    L. Breiman, Bagging predictors. Mach. Learn. 24(2), 123–140 (1996)zbMATHGoogle Scholar
  7. 7.
    M.M. Breunig, H.P. Kriegel, R.T. Ng, J. Sander, Lof: identifying density-based local outliers, in ACM Sigmod Record, vol. 29. (ACM, 2000), pp. 93–104Google Scholar
  8. 8.
    C.E. Brodley, M.A. Friedl, Identifying mislabeled training data. J. Artif. Intell. Res. 11, 131–167 (1999)CrossRefGoogle Scholar
  9. 9.
    O. Chapelle, B. Scholkopf, A. Zien, Semi-supervised learning. IEEE Trans. Neural Netw. 20(3), 542–542 (2009)CrossRefGoogle Scholar
  10. 10.
    P. Davidsson, Coin classification using a novel technique for learning characteristic decision trees by controlling the degree of generalization, in 9th International Conference on Industrial and Engineering Applications of Artificial Intelligence and Expert Systems (1997), pp. 403–412Google Scholar
  11. 11.
    C. Dwork, R. Kumar, M. Naor, D. Sivakumar, Rank aggregation methods for the web, in Proceedings of the 10th International Conference on World Wide Web (ACM, 2001), pp. 613–622Google Scholar
  12. 12.
    B. Frénay, M. Verleysen, Classification in the presence of label noise: a survey. IEEE Trans. Neural Netw. Learn. Syst. 25(5), 845–869 (2014)CrossRefGoogle Scholar
  13. 13.
    D. Gamberger, N. Lavrac, S. D\(\check{z}\)eroski, Noise detection and elimination in data preprocessing: Experiments in medical domains. Appl. Artif. Intell. 14(2), 205–223 (2000)Google Scholar
  14. 14.
    M. Halkidi, Y. Batistakis, M. Vazirgiannis, On clustering validation techniques. J. Intell. Inf. Syst. 17(2–3), 107–145 (2001)CrossRefGoogle Scholar
  15. 15.
    J. Handl, J. Knowles, D. Kell, Computational cluster validation in post-genomic data analysis. Bioinformatics 21(15), 3201–3212 (2005)CrossRefGoogle Scholar
  16. 16.
    Z. He, S. Deng, X. Xu, Outlier detection integrating semantic knowledge, in International Conference on Web-Age Information Management. (Springer, 2002), pp. 126–131Google Scholar
  17. 17.
    Z. He, X. Xu, J. Huang, S. Deng, Mining class outliers: Concepts, algorithms and applications in crm. Expert. Syst. Appl. 27(4), 681–697 (2004)CrossRefGoogle Scholar
  18. 18.
    N. Hewahi, M. Saad, Class outliers mining: Distance-based approach. Int. J. Intell. Syst. Technol. 2, 5 (2007)Google Scholar
  19. 19.
    A. Jain, R. Dubes, Algorithms for Clustering Data (Prentice-Hall Inc, Upper Saddle River, NJ, USA, 1988)zbMATHGoogle Scholar
  20. 20.
    P.A. Jaskowiak, D. Moulavi, C.A. Furtado, R.J. Campello, A. Zimek, J. Sander, On strategies for building effective ensembles of relative clustering validity criteria. Knowl. Inf. Syst. 47(2), 329–354 (2016)CrossRefGoogle Scholar
  21. 21.
    T.M. Khoshgoftaar, P. Rebours, Generating multiple noise elimination filters with the ensemble-partitioning filter, Information Reuse and Integration, 2004. IRI 2004. Proceedings of the 2004 IEEE International Conference on IEEE (2004), pp. 369–375Google Scholar
  22. 22.
    T.M. Khoshgoftaar, N. Seliya, K. Gao, Rule-based noise detection for software measurement data, Information Reuse and Integration, 2004. IRI 2004. Proceedings of the 2004 IEEE International Conference on IEEE (2004), pp. 302–307Google Scholar
  23. 23.
    R. Kohavi, A study of cross-validation and bootstrap for accuracy estimation and model selection, in Ijcai, vol. 14. (Montreal, Canada 1995), pp. 1137–1145Google Scholar
  24. 24.
    R. Kolde, S. Laur, P. Adler, J. Vilo, Robust rank aggregation for gene list integration and meta-analysis. Bioinformatics 28(4), 573–580 (2012)CrossRefGoogle Scholar
  25. 25.
    H.P. Kriegel, P. Kroger, E. Schubert, A. Zimek, Interpreting and unifying outlier scores, in Proceedings of the 2011 SIAM International Conference on Data Mining. (SIAM, 2011), pp. 13–24Google Scholar
  26. 26.
    J.M. Kubica, A. Moore, Probabilistic noise identification and data cleaning, in ICDM (2003), pp. 131–138Google Scholar
  27. 27.
    B. Larsen, C. Aone, Fast and effective text mining using linear-time document clustering, in Proceedings of the 5th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. (ACM, 1999), pp. 16–22Google Scholar
  28. 28.
    N. Lavesson, P. Davidsson, A multi-dimensional measure function for classifier performance, Intelligent Systems, 2004, in Proceedings. 2004 2nd International IEEE Conference, vol. 2. (IEEE, 2004), pp. 508–513Google Scholar
  29. 29.
    Y. Liu, Understanding and enhancement of internal clustering validation measures. IEEE Trans. Cybern. 43(3), 982–994 (2013)CrossRefGoogle Scholar
  30. 30.
    E. Müller, I. Assent, P. Iglesias, Y. Mulle, K. Bohm, Outlier ranking via subspace analysis in multiple views of the data, in Data Mining (ICDM), 2012 IEEE 12th International Conference on IEEE (2012), pp. 529–538Google Scholar
  31. 31.
    E. Müller, I. Assent, U. Steinhausen, T. Seidl, Outrank: Ranking outliers in high dimensional data, in Data Engineering Workshop, 2008. ICDEW 2008. IEEE 24th International Conference on IEEE (2008), pp. 600–603Google Scholar
  32. 32.
    H.V. Nguyen, H.H. Ang, V. Gopalkrishnan, Mining outliers with ensemble of heterogeneous detectors on random subspaces, in International Conference on Database Systems for Advanced Applications (Springer, 2010), pp. 368–383Google Scholar
  33. 33.
    S. Papadimitriou, C. Faloutsos, Cross-outlier detection, in International Symposium on Spatial and Temporal Databases. (Springer, 2003), pp. 199–213Google Scholar
  34. 34.
    P. Rousseeuw, Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 20, 53–65 (1987)CrossRefGoogle Scholar
  35. 35.
    R.E. Schapire, The strength of weak learnability. Mach. Learn. 5(2), 197–227 (1990)Google Scholar
  36. 36.
    E. Schubert, R. Wojdanowski, A. Zimek, H.P. Kriegel, On evaluation of outlier rankings and outlier scores, in Proceedings of the 2012 SIAM International Conference on Data Mining (SIAM, 2012), pp. 1047–1058Google Scholar
  37. 37.
    N. Segata, E. Blanzieri, Fast and scalable local kernel machines. J. Mach. Learn. Res. 11, 1883–1926 (2010)MathSciNetzbMATHGoogle Scholar
  38. 38.
    M. Smith, T. Martinez, Improving classification accuracy by identifying and removing instances that should be misclassified, in Neural Networks (IJCNN), The 2011 International Joint Conference on IEEE (2011), pp. 2690–2697Google Scholar
  39. 39.
    M. Smith, T. Martinez, A comparative evaluation of curriculum learning with filtering and boosting in supervised classification problems. Comput. Intell. 32(2), 167–195 (2016)MathSciNetCrossRefGoogle Scholar
  40. 40.
    I. Tomek, An experiment with the edited nearest-neighbor rule. IEEE Trans. Syst. Man Cybern. SMC–6(6), 448–452 (1976). Scholar
  41. 41.
    E. Tsiporkova, V. Boeva, Nonparametric recursive aggregation process. Kybern. J. Czech Soc. Cybern. Inf. Sci. 40(1), 51–70 (2004)MathSciNetzbMATHGoogle Scholar
  42. 42.
    L. Vendramin, R. Campello, E.R. Hruschka, Relative clustering validity criteria: A comparative overview. Stat. Anal. Data Min. ASA Data Sci. J. 3(4), 209–235 (2010)MathSciNetGoogle Scholar
  43. 43.
    L. Vendramin, P. Jaskowiak, R. Campello, On the combination of relative clustering validity criteria, in Proceedings of the 25th International Conference on Scientific and Statistical Database Management (ACM, 2013), p. 4Google Scholar
  44. 44.
    D. Xu, Y. Tian, A comprehensive survey of clustering algorithms. Ann. Data Sci. 2(2), 165–193 (2015)MathSciNetCrossRefGoogle Scholar
  45. 45.
    X. Zeng, T.R. Martinez, An algorithm for correcting mislabeled data. Intell. Data Anal. 5(6), 491–502 (2001)CrossRefGoogle Scholar
  46. 46.
    A. Zimek, R.J. Campello, J. Sander, Ensembles for unsupervised outlier detection: Ehallenges and research questions a position paper. Acm Sigkdd Explor. Newsl. 15(1), 11–22 (2014)Google Scholar
  47. 47.
    A. Zimek, E. Schubert, H.P. Kriegel, A survey on unsupervised outlier detection in high-dimensional numerical data. Stat. Anal. Data Min. ASA Data Sci. J. 5(5), 363–387 (2012)MathSciNetCrossRefGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2020

Authors and Affiliations

  • Jan Kohstall
    • 1
  • Veselka Boeva
    • 2
    Email author
  • Lars Lundberg
    • 2
  • Milena Angelova
    • 3
  1. 1.acs Plus GmbHBerlinGermany
  2. 2.Blekinge Institute of TechnologyKarlskronaSweden
  3. 3.Technical University of SofiaPlovdivBulgaria

Personalised recommendations