Abstract
Cluster validation measures are designed to find the partitioning that best fits the underlying data. In this study, we show that these measures can be used for identifying mislabeled instances or class outliers prior to training in supervised learning problems. We introduce an ensemble technique, entitled CVI-based Outlier Filtering, which identifies and eliminates mislabeled instances from the training set, and then builds a classification hypothesis from the set of remaining instances. Our approach assigns to each instance in the training set several cluster validation scores representing its potential of being a class outlier with respect to the clustering properties the used validation measures assess. In this respect, the proposed approach may be referred to a multi-criteria outlier filtering measure. In this work, we specifically study and evaluate valued-based ensembles of cluster validation indices. The added value of this approach in comparison to the logical and rank-based ensemble solutions are discussed and further demonstrated.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
C.C. Aggarwal, Outlier ensembles: Position paper. ACM SIGKDD Explor. Newsl. 14(2), 49–58 (2013)
A.E. Bayá, P.M. Granitto, How many clusters: A validation index for arbitrary-shaped clusters. IEEE/ACM Trans. Comput. Biol. Bioinform. (TCBB) 10(2), 401–414 (2013)
J. Bezdek, N. Pal, Some new indexes of cluster validity. IEEE Trans. Syst. Man Cybern. Part B (Cybern.) 28(3), 301–315 (1998)
V. Boeva, J. Kohstall, L. Lundberg, M. Angelova, Combining cluster validation indices for detecting label noise, in Archives of Data Science, Series A, p. submitted (2018)
V. Boeva, L. Lundberg, M. Angelova, J. Kohstall, Cluster validation measures for label noise filtering, in 9th IEEE International Conference on Intelligent Systems (IS’18), pp. 109–116 (2018)
L. Breiman, Bagging predictors. Mach. Learn. 24(2), 123–140 (1996)
M.M. Breunig, H.P. Kriegel, R.T. Ng, J. Sander, Lof: identifying density-based local outliers, in ACM Sigmod Record, vol. 29. (ACM, 2000), pp. 93–104
C.E. Brodley, M.A. Friedl, Identifying mislabeled training data. J. Artif. Intell. Res. 11, 131–167 (1999)
O. Chapelle, B. Scholkopf, A. Zien, Semi-supervised learning. IEEE Trans. Neural Netw. 20(3), 542–542 (2009)
P. Davidsson, Coin classification using a novel technique for learning characteristic decision trees by controlling the degree of generalization, in 9th International Conference on Industrial and Engineering Applications of Artificial Intelligence and Expert Systems (1997), pp. 403–412
C. Dwork, R. Kumar, M. Naor, D. Sivakumar, Rank aggregation methods for the web, in Proceedings of the 10th International Conference on World Wide Web (ACM, 2001), pp. 613–622
B. Frénay, M. Verleysen, Classification in the presence of label noise: a survey. IEEE Trans. Neural Netw. Learn. Syst. 25(5), 845–869 (2014)
D. Gamberger, N. Lavrac, S. D\(\check{z}\)eroski, Noise detection and elimination in data preprocessing: Experiments in medical domains. Appl. Artif. Intell. 14(2), 205–223 (2000)
M. Halkidi, Y. Batistakis, M. Vazirgiannis, On clustering validation techniques. J. Intell. Inf. Syst. 17(2–3), 107–145 (2001)
J. Handl, J. Knowles, D. Kell, Computational cluster validation in post-genomic data analysis. Bioinformatics 21(15), 3201–3212 (2005)
Z. He, S. Deng, X. Xu, Outlier detection integrating semantic knowledge, in International Conference on Web-Age Information Management. (Springer, 2002), pp. 126–131
Z. He, X. Xu, J. Huang, S. Deng, Mining class outliers: Concepts, algorithms and applications in crm. Expert. Syst. Appl. 27(4), 681–697 (2004)
N. Hewahi, M. Saad, Class outliers mining: Distance-based approach. Int. J. Intell. Syst. Technol. 2, 5 (2007)
A. Jain, R. Dubes, Algorithms for Clustering Data (Prentice-Hall Inc, Upper Saddle River, NJ, USA, 1988)
P.A. Jaskowiak, D. Moulavi, C.A. Furtado, R.J. Campello, A. Zimek, J. Sander, On strategies for building effective ensembles of relative clustering validity criteria. Knowl. Inf. Syst. 47(2), 329–354 (2016)
T.M. Khoshgoftaar, P. Rebours, Generating multiple noise elimination filters with the ensemble-partitioning filter, Information Reuse and Integration, 2004. IRI 2004. Proceedings of the 2004 IEEE International Conference on IEEE (2004), pp. 369–375
T.M. Khoshgoftaar, N. Seliya, K. Gao, Rule-based noise detection for software measurement data, Information Reuse and Integration, 2004. IRI 2004. Proceedings of the 2004 IEEE International Conference on IEEE (2004), pp. 302–307
R. Kohavi, A study of cross-validation and bootstrap for accuracy estimation and model selection, in Ijcai, vol. 14. (Montreal, Canada 1995), pp. 1137–1145
R. Kolde, S. Laur, P. Adler, J. Vilo, Robust rank aggregation for gene list integration and meta-analysis. Bioinformatics 28(4), 573–580 (2012)
H.P. Kriegel, P. Kroger, E. Schubert, A. Zimek, Interpreting and unifying outlier scores, in Proceedings of the 2011 SIAM International Conference on Data Mining. (SIAM, 2011), pp. 13–24
J.M. Kubica, A. Moore, Probabilistic noise identification and data cleaning, in ICDM (2003), pp. 131–138
B. Larsen, C. Aone, Fast and effective text mining using linear-time document clustering, in Proceedings of the 5th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. (ACM, 1999), pp. 16–22
N. Lavesson, P. Davidsson, A multi-dimensional measure function for classifier performance, Intelligent Systems, 2004, in Proceedings. 2004 2nd International IEEE Conference, vol. 2. (IEEE, 2004), pp. 508–513
Y. Liu, Understanding and enhancement of internal clustering validation measures. IEEE Trans. Cybern. 43(3), 982–994 (2013)
E. Müller, I. Assent, P. Iglesias, Y. Mulle, K. Bohm, Outlier ranking via subspace analysis in multiple views of the data, in Data Mining (ICDM), 2012 IEEE 12th International Conference on IEEE (2012), pp. 529–538
E. Müller, I. Assent, U. Steinhausen, T. Seidl, Outrank: Ranking outliers in high dimensional data, in Data Engineering Workshop, 2008. ICDEW 2008. IEEE 24th International Conference on IEEE (2008), pp. 600–603
H.V. Nguyen, H.H. Ang, V. Gopalkrishnan, Mining outliers with ensemble of heterogeneous detectors on random subspaces, in International Conference on Database Systems for Advanced Applications (Springer, 2010), pp. 368–383
S. Papadimitriou, C. Faloutsos, Cross-outlier detection, in International Symposium on Spatial and Temporal Databases. (Springer, 2003), pp. 199–213
P. Rousseeuw, Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 20, 53–65 (1987)
R.E. Schapire, The strength of weak learnability. Mach. Learn. 5(2), 197–227 (1990)
E. Schubert, R. Wojdanowski, A. Zimek, H.P. Kriegel, On evaluation of outlier rankings and outlier scores, in Proceedings of the 2012 SIAM International Conference on Data Mining (SIAM, 2012), pp. 1047–1058
N. Segata, E. Blanzieri, Fast and scalable local kernel machines. J. Mach. Learn. Res. 11, 1883–1926 (2010)
M. Smith, T. Martinez, Improving classification accuracy by identifying and removing instances that should be misclassified, in Neural Networks (IJCNN), The 2011 International Joint Conference on IEEE (2011), pp. 2690–2697
M. Smith, T. Martinez, A comparative evaluation of curriculum learning with filtering and boosting in supervised classification problems. Comput. Intell. 32(2), 167–195 (2016)
I. Tomek, An experiment with the edited nearest-neighbor rule. IEEE Trans. Syst. Man Cybern. SMC–6(6), 448–452 (1976). https://doi.org/10.1109/TSMC.1976.4309523
E. Tsiporkova, V. Boeva, Nonparametric recursive aggregation process. Kybern. J. Czech Soc. Cybern. Inf. Sci. 40(1), 51–70 (2004)
L. Vendramin, R. Campello, E.R. Hruschka, Relative clustering validity criteria: A comparative overview. Stat. Anal. Data Min. ASA Data Sci. J. 3(4), 209–235 (2010)
L. Vendramin, P. Jaskowiak, R. Campello, On the combination of relative clustering validity criteria, in Proceedings of the 25th International Conference on Scientific and Statistical Database Management (ACM, 2013), p. 4
D. Xu, Y. Tian, A comprehensive survey of clustering algorithms. Ann. Data Sci. 2(2), 165–193 (2015)
X. Zeng, T.R. Martinez, An algorithm for correcting mislabeled data. Intell. Data Anal. 5(6), 491–502 (2001)
A. Zimek, R.J. Campello, J. Sander, Ensembles for unsupervised outlier detection: Ehallenges and research questions a position paper. Acm Sigkdd Explor. Newsl. 15(1), 11–22 (2014)
A. Zimek, E. Schubert, H.P. Kriegel, A survey on unsupervised outlier detection in high-dimensional numerical data. Stat. Anal. Data Min. ASA Data Sci. J. 5(5), 363–387 (2012)
Acknowledgements
This work is part of the research project “Scalable resource efficient systems for big data analytics” funded by the Knowledge Foundation (grant: 20140032) in Sweden.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this chapter
Cite this chapter
Kohstall, J., Boeva, V., Lundberg, L., Angelova, M. (2020). Ensembles of Cluster Validation Indices for Label Noise Filtering. In: Jardim-Goncalves, R., Sgurev, V., Jotsov, V., Kacprzyk, J. (eds) Intelligent Systems: Theory, Research and Innovation in Applications. Studies in Computational Intelligence, vol 864. Springer, Cham. https://doi.org/10.1007/978-3-030-38704-4_4
Download citation
DOI: https://doi.org/10.1007/978-3-030-38704-4_4
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-38703-7
Online ISBN: 978-3-030-38704-4
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)