Skip to main content

Ensembles of Cluster Validation Indices for Label Noise Filtering

  • Chapter
  • First Online:
Intelligent Systems: Theory, Research and Innovation in Applications

Part of the book series: Studies in Computational Intelligence ((SCI,volume 864))

Abstract

Cluster validation measures are designed to find the partitioning that best fits the underlying data. In this study, we show that these measures can be used for identifying mislabeled instances or class outliers prior to training in supervised learning problems. We introduce an ensemble technique, entitled CVI-based Outlier Filtering, which identifies and eliminates mislabeled instances from the training set, and then builds a classification hypothesis from the set of remaining instances. Our approach assigns to each instance in the training set several cluster validation scores representing its potential of being a class outlier with respect to the clustering properties the used validation measures assess. In this respect, the proposed approach may be referred to a multi-criteria outlier filtering measure. In this work, we specifically study and evaluate valued-based ensembles of cluster validation indices. The added value of this approach in comparison to the logical and rank-based ensemble solutions are discussed and further demonstrated.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

eBook
USD 16.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 109.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://archive.ics.uci.edu/ml/index.php.

  2. 2.

    https://gitlab.com/machine_learning_vm/outliers.

References

  1. C.C. Aggarwal, Outlier ensembles: Position paper. ACM SIGKDD Explor. Newsl. 14(2), 49–58 (2013)

    Article  Google Scholar 

  2. A.E. Bayá, P.M. Granitto, How many clusters: A validation index for arbitrary-shaped clusters. IEEE/ACM Trans. Comput. Biol. Bioinform. (TCBB) 10(2), 401–414 (2013)

    Article  Google Scholar 

  3. J. Bezdek, N. Pal, Some new indexes of cluster validity. IEEE Trans. Syst. Man Cybern. Part B (Cybern.) 28(3), 301–315 (1998)

    Article  Google Scholar 

  4. V. Boeva, J. Kohstall, L. Lundberg, M. Angelova, Combining cluster validation indices for detecting label noise, in Archives of Data Science, Series A, p. submitted (2018)

    Google Scholar 

  5. V. Boeva, L. Lundberg, M. Angelova, J. Kohstall, Cluster validation measures for label noise filtering, in 9th IEEE International Conference on Intelligent Systems (IS’18), pp. 109–116 (2018)

    Google Scholar 

  6. L. Breiman, Bagging predictors. Mach. Learn. 24(2), 123–140 (1996)

    MATH  Google Scholar 

  7. M.M. Breunig, H.P. Kriegel, R.T. Ng, J. Sander, Lof: identifying density-based local outliers, in ACM Sigmod Record, vol. 29. (ACM, 2000), pp. 93–104

    Google Scholar 

  8. C.E. Brodley, M.A. Friedl, Identifying mislabeled training data. J. Artif. Intell. Res. 11, 131–167 (1999)

    Article  Google Scholar 

  9. O. Chapelle, B. Scholkopf, A. Zien, Semi-supervised learning. IEEE Trans. Neural Netw. 20(3), 542–542 (2009)

    Article  Google Scholar 

  10. P. Davidsson, Coin classification using a novel technique for learning characteristic decision trees by controlling the degree of generalization, in 9th International Conference on Industrial and Engineering Applications of Artificial Intelligence and Expert Systems (1997), pp. 403–412

    Google Scholar 

  11. C. Dwork, R. Kumar, M. Naor, D. Sivakumar, Rank aggregation methods for the web, in Proceedings of the 10th International Conference on World Wide Web (ACM, 2001), pp. 613–622

    Google Scholar 

  12. B. Frénay, M. Verleysen, Classification in the presence of label noise: a survey. IEEE Trans. Neural Netw. Learn. Syst. 25(5), 845–869 (2014)

    Article  Google Scholar 

  13. D. Gamberger, N. Lavrac, S. D\(\check{z}\)eroski, Noise detection and elimination in data preprocessing: Experiments in medical domains. Appl. Artif. Intell. 14(2), 205–223 (2000)

    Google Scholar 

  14. M. Halkidi, Y. Batistakis, M. Vazirgiannis, On clustering validation techniques. J. Intell. Inf. Syst. 17(2–3), 107–145 (2001)

    Article  Google Scholar 

  15. J. Handl, J. Knowles, D. Kell, Computational cluster validation in post-genomic data analysis. Bioinformatics 21(15), 3201–3212 (2005)

    Article  Google Scholar 

  16. Z. He, S. Deng, X. Xu, Outlier detection integrating semantic knowledge, in International Conference on Web-Age Information Management. (Springer, 2002), pp. 126–131

    Google Scholar 

  17. Z. He, X. Xu, J. Huang, S. Deng, Mining class outliers: Concepts, algorithms and applications in crm. Expert. Syst. Appl. 27(4), 681–697 (2004)

    Article  Google Scholar 

  18. N. Hewahi, M. Saad, Class outliers mining: Distance-based approach. Int. J. Intell. Syst. Technol. 2, 5 (2007)

    Google Scholar 

  19. A. Jain, R. Dubes, Algorithms for Clustering Data (Prentice-Hall Inc, Upper Saddle River, NJ, USA, 1988)

    MATH  Google Scholar 

  20. P.A. Jaskowiak, D. Moulavi, C.A. Furtado, R.J. Campello, A. Zimek, J. Sander, On strategies for building effective ensembles of relative clustering validity criteria. Knowl. Inf. Syst. 47(2), 329–354 (2016)

    Article  Google Scholar 

  21. T.M. Khoshgoftaar, P. Rebours, Generating multiple noise elimination filters with the ensemble-partitioning filter, Information Reuse and Integration, 2004. IRI 2004. Proceedings of the 2004 IEEE International Conference on IEEE (2004), pp. 369–375

    Google Scholar 

  22. T.M. Khoshgoftaar, N. Seliya, K. Gao, Rule-based noise detection for software measurement data, Information Reuse and Integration, 2004. IRI 2004. Proceedings of the 2004 IEEE International Conference on IEEE (2004), pp. 302–307

    Google Scholar 

  23. R. Kohavi, A study of cross-validation and bootstrap for accuracy estimation and model selection, in Ijcai, vol. 14. (Montreal, Canada 1995), pp. 1137–1145

    Google Scholar 

  24. R. Kolde, S. Laur, P. Adler, J. Vilo, Robust rank aggregation for gene list integration and meta-analysis. Bioinformatics 28(4), 573–580 (2012)

    Article  Google Scholar 

  25. H.P. Kriegel, P. Kroger, E. Schubert, A. Zimek, Interpreting and unifying outlier scores, in Proceedings of the 2011 SIAM International Conference on Data Mining. (SIAM, 2011), pp. 13–24

    Google Scholar 

  26. J.M. Kubica, A. Moore, Probabilistic noise identification and data cleaning, in ICDM (2003), pp. 131–138

    Google Scholar 

  27. B. Larsen, C. Aone, Fast and effective text mining using linear-time document clustering, in Proceedings of the 5th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. (ACM, 1999), pp. 16–22

    Google Scholar 

  28. N. Lavesson, P. Davidsson, A multi-dimensional measure function for classifier performance, Intelligent Systems, 2004, in Proceedings. 2004 2nd International IEEE Conference, vol. 2. (IEEE, 2004), pp. 508–513

    Google Scholar 

  29. Y. Liu, Understanding and enhancement of internal clustering validation measures. IEEE Trans. Cybern. 43(3), 982–994 (2013)

    Article  Google Scholar 

  30. E. Müller, I. Assent, P. Iglesias, Y. Mulle, K. Bohm, Outlier ranking via subspace analysis in multiple views of the data, in Data Mining (ICDM), 2012 IEEE 12th International Conference on IEEE (2012), pp. 529–538

    Google Scholar 

  31. E. Müller, I. Assent, U. Steinhausen, T. Seidl, Outrank: Ranking outliers in high dimensional data, in Data Engineering Workshop, 2008. ICDEW 2008. IEEE 24th International Conference on IEEE (2008), pp. 600–603

    Google Scholar 

  32. H.V. Nguyen, H.H. Ang, V. Gopalkrishnan, Mining outliers with ensemble of heterogeneous detectors on random subspaces, in International Conference on Database Systems for Advanced Applications (Springer, 2010), pp. 368–383

    Google Scholar 

  33. S. Papadimitriou, C. Faloutsos, Cross-outlier detection, in International Symposium on Spatial and Temporal Databases. (Springer, 2003), pp. 199–213

    Google Scholar 

  34. P. Rousseeuw, Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 20, 53–65 (1987)

    Article  Google Scholar 

  35. R.E. Schapire, The strength of weak learnability. Mach. Learn. 5(2), 197–227 (1990)

    Google Scholar 

  36. E. Schubert, R. Wojdanowski, A. Zimek, H.P. Kriegel, On evaluation of outlier rankings and outlier scores, in Proceedings of the 2012 SIAM International Conference on Data Mining (SIAM, 2012), pp. 1047–1058

    Google Scholar 

  37. N. Segata, E. Blanzieri, Fast and scalable local kernel machines. J. Mach. Learn. Res. 11, 1883–1926 (2010)

    MathSciNet  MATH  Google Scholar 

  38. M. Smith, T. Martinez, Improving classification accuracy by identifying and removing instances that should be misclassified, in Neural Networks (IJCNN), The 2011 International Joint Conference on IEEE (2011), pp. 2690–2697

    Google Scholar 

  39. M. Smith, T. Martinez, A comparative evaluation of curriculum learning with filtering and boosting in supervised classification problems. Comput. Intell. 32(2), 167–195 (2016)

    Article  MathSciNet  Google Scholar 

  40. I. Tomek, An experiment with the edited nearest-neighbor rule. IEEE Trans. Syst. Man Cybern. SMC–6(6), 448–452 (1976). https://doi.org/10.1109/TSMC.1976.4309523

    Article  MathSciNet  MATH  Google Scholar 

  41. E. Tsiporkova, V. Boeva, Nonparametric recursive aggregation process. Kybern. J. Czech Soc. Cybern. Inf. Sci. 40(1), 51–70 (2004)

    MathSciNet  MATH  Google Scholar 

  42. L. Vendramin, R. Campello, E.R. Hruschka, Relative clustering validity criteria: A comparative overview. Stat. Anal. Data Min. ASA Data Sci. J. 3(4), 209–235 (2010)

    MathSciNet  Google Scholar 

  43. L. Vendramin, P. Jaskowiak, R. Campello, On the combination of relative clustering validity criteria, in Proceedings of the 25th International Conference on Scientific and Statistical Database Management (ACM, 2013), p. 4

    Google Scholar 

  44. D. Xu, Y. Tian, A comprehensive survey of clustering algorithms. Ann. Data Sci. 2(2), 165–193 (2015)

    Article  MathSciNet  Google Scholar 

  45. X. Zeng, T.R. Martinez, An algorithm for correcting mislabeled data. Intell. Data Anal. 5(6), 491–502 (2001)

    Article  Google Scholar 

  46. A. Zimek, R.J. Campello, J. Sander, Ensembles for unsupervised outlier detection: Ehallenges and research questions a position paper. Acm Sigkdd Explor. Newsl. 15(1), 11–22 (2014)

    Google Scholar 

  47. A. Zimek, E. Schubert, H.P. Kriegel, A survey on unsupervised outlier detection in high-dimensional numerical data. Stat. Anal. Data Min. ASA Data Sci. J. 5(5), 363–387 (2012)

    Article  MathSciNet  Google Scholar 

Download references

Acknowledgements

This work is part of the research project “Scalable resource efficient systems for big data analytics” funded by the Knowledge Foundation (grant: 20140032) in Sweden.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Veselka Boeva .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Kohstall, J., Boeva, V., Lundberg, L., Angelova, M. (2020). Ensembles of Cluster Validation Indices for Label Noise Filtering. In: Jardim-Goncalves, R., Sgurev, V., Jotsov, V., Kacprzyk, J. (eds) Intelligent Systems: Theory, Research and Innovation in Applications. Studies in Computational Intelligence, vol 864. Springer, Cham. https://doi.org/10.1007/978-3-030-38704-4_4

Download citation

Publish with us

Policies and ethics