Ensembles of Cluster Validation Indices for Label Noise Filtering

Kohstall, Jan; Boeva, Veselka; Lundberg, Lars; Angelova, Milena

doi:10.1007/978-3-030-38704-4_4

Jan Kohstall⁶,
Veselka Boeva⁷,
Lars Lundberg⁷ &
…
Milena Angelova⁸

Part of the book series: Studies in Computational Intelligence ((SCI,volume 864))

458 Accesses
1 Citations

Abstract

Cluster validation measures are designed to find the partitioning that best fits the underlying data. In this study, we show that these measures can be used for identifying mislabeled instances or class outliers prior to training in supervised learning problems. We introduce an ensemble technique, entitled CVI-based Outlier Filtering, which identifies and eliminates mislabeled instances from the training set, and then builds a classification hypothesis from the set of remaining instances. Our approach assigns to each instance in the training set several cluster validation scores representing its potential of being a class outlier with respect to the clustering properties the used validation measures assess. In this respect, the proposed approach may be referred to a multi-criteria outlier filtering measure. In this work, we specifically study and evaluate valued-based ensembles of cluster validation indices. The added value of this approach in comparison to the logical and rank-based ensemble solutions are discussed and further demonstrated.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

eBook: USD 16.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Hardcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

References

C.C. Aggarwal, Outlier ensembles: Position paper. ACM SIGKDD Explor. Newsl. 14(2), 49–58 (2013)
Article Google Scholar
A.E. Bayá, P.M. Granitto, How many clusters: A validation index for arbitrary-shaped clusters. IEEE/ACM Trans. Comput. Biol. Bioinform. (TCBB) 10(2), 401–414 (2013)
Article Google Scholar
J. Bezdek, N. Pal, Some new indexes of cluster validity. IEEE Trans. Syst. Man Cybern. Part B (Cybern.) 28(3), 301–315 (1998)
Article Google Scholar
V. Boeva, J. Kohstall, L. Lundberg, M. Angelova, Combining cluster validation indices for detecting label noise, in Archives of Data Science, Series A, p. submitted (2018)
Google Scholar
V. Boeva, L. Lundberg, M. Angelova, J. Kohstall, Cluster validation measures for label noise filtering, in 9th IEEE International Conference on Intelligent Systems (IS’18), pp. 109–116 (2018)
Google Scholar
L. Breiman, Bagging predictors. Mach. Learn. 24(2), 123–140 (1996)
MATH Google Scholar
M.M. Breunig, H.P. Kriegel, R.T. Ng, J. Sander, Lof: identifying density-based local outliers, in ACM Sigmod Record, vol. 29. (ACM, 2000), pp. 93–104
Google Scholar
C.E. Brodley, M.A. Friedl, Identifying mislabeled training data. J. Artif. Intell. Res. 11, 131–167 (1999)
Article Google Scholar
O. Chapelle, B. Scholkopf, A. Zien, Semi-supervised learning. IEEE Trans. Neural Netw. 20(3), 542–542 (2009)
Article Google Scholar
P. Davidsson, Coin classification using a novel technique for learning characteristic decision trees by controlling the degree of generalization, in 9th International Conference on Industrial and Engineering Applications of Artificial Intelligence and Expert Systems (1997), pp. 403–412
Google Scholar
C. Dwork, R. Kumar, M. Naor, D. Sivakumar, Rank aggregation methods for the web, in Proceedings of the 10th International Conference on World Wide Web (ACM, 2001), pp. 613–622
Google Scholar
B. Frénay, M. Verleysen, Classification in the presence of label noise: a survey. IEEE Trans. Neural Netw. Learn. Syst. 25(5), 845–869 (2014)
Article Google Scholar
D. Gamberger, N. Lavrac, S. D\(\check{z}\)eroski, Noise detection and elimination in data preprocessing: Experiments in medical domains. Appl. Artif. Intell. 14(2), 205–223 (2000)
Google Scholar
M. Halkidi, Y. Batistakis, M. Vazirgiannis, On clustering validation techniques. J. Intell. Inf. Syst. 17(2–3), 107–145 (2001)
Article Google Scholar
J. Handl, J. Knowles, D. Kell, Computational cluster validation in post-genomic data analysis. Bioinformatics 21(15), 3201–3212 (2005)
Article Google Scholar
Z. He, S. Deng, X. Xu, Outlier detection integrating semantic knowledge, in International Conference on Web-Age Information Management. (Springer, 2002), pp. 126–131
Google Scholar
Z. He, X. Xu, J. Huang, S. Deng, Mining class outliers: Concepts, algorithms and applications in crm. Expert. Syst. Appl. 27(4), 681–697 (2004)
Article Google Scholar
N. Hewahi, M. Saad, Class outliers mining: Distance-based approach. Int. J. Intell. Syst. Technol. 2, 5 (2007)
Google Scholar
A. Jain, R. Dubes, Algorithms for Clustering Data (Prentice-Hall Inc, Upper Saddle River, NJ, USA, 1988)
MATH Google Scholar
P.A. Jaskowiak, D. Moulavi, C.A. Furtado, R.J. Campello, A. Zimek, J. Sander, On strategies for building effective ensembles of relative clustering validity criteria. Knowl. Inf. Syst. 47(2), 329–354 (2016)
Article Google Scholar
T.M. Khoshgoftaar, P. Rebours, Generating multiple noise elimination filters with the ensemble-partitioning filter, Information Reuse and Integration, 2004. IRI 2004. Proceedings of the 2004 IEEE International Conference on IEEE (2004), pp. 369–375
Google Scholar
T.M. Khoshgoftaar, N. Seliya, K. Gao, Rule-based noise detection for software measurement data, Information Reuse and Integration, 2004. IRI 2004. Proceedings of the 2004 IEEE International Conference on IEEE (2004), pp. 302–307
Google Scholar
R. Kohavi, A study of cross-validation and bootstrap for accuracy estimation and model selection, in Ijcai, vol. 14. (Montreal, Canada 1995), pp. 1137–1145
Google Scholar
R. Kolde, S. Laur, P. Adler, J. Vilo, Robust rank aggregation for gene list integration and meta-analysis. Bioinformatics 28(4), 573–580 (2012)
Article Google Scholar
H.P. Kriegel, P. Kroger, E. Schubert, A. Zimek, Interpreting and unifying outlier scores, in Proceedings of the 2011 SIAM International Conference on Data Mining. (SIAM, 2011), pp. 13–24
Google Scholar
J.M. Kubica, A. Moore, Probabilistic noise identification and data cleaning, in ICDM (2003), pp. 131–138
Google Scholar
B. Larsen, C. Aone, Fast and effective text mining using linear-time document clustering, in Proceedings of the 5th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. (ACM, 1999), pp. 16–22
Google Scholar
N. Lavesson, P. Davidsson, A multi-dimensional measure function for classifier performance, Intelligent Systems, 2004, in Proceedings. 2004 2nd International IEEE Conference, vol. 2. (IEEE, 2004), pp. 508–513
Google Scholar
Y. Liu, Understanding and enhancement of internal clustering validation measures. IEEE Trans. Cybern. 43(3), 982–994 (2013)
Article Google Scholar
E. Müller, I. Assent, P. Iglesias, Y. Mulle, K. Bohm, Outlier ranking via subspace analysis in multiple views of the data, in Data Mining (ICDM), 2012 IEEE 12th International Conference on IEEE (2012), pp. 529–538
Google Scholar
E. Müller, I. Assent, U. Steinhausen, T. Seidl, Outrank: Ranking outliers in high dimensional data, in Data Engineering Workshop, 2008. ICDEW 2008. IEEE 24th International Conference on IEEE (2008), pp. 600–603
Google Scholar
H.V. Nguyen, H.H. Ang, V. Gopalkrishnan, Mining outliers with ensemble of heterogeneous detectors on random subspaces, in International Conference on Database Systems for Advanced Applications (Springer, 2010), pp. 368–383
Google Scholar
S. Papadimitriou, C. Faloutsos, Cross-outlier detection, in International Symposium on Spatial and Temporal Databases. (Springer, 2003), pp. 199–213
Google Scholar
P. Rousseeuw, Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 20, 53–65 (1987)
Article Google Scholar
R.E. Schapire, The strength of weak learnability. Mach. Learn. 5(2), 197–227 (1990)
Google Scholar
E. Schubert, R. Wojdanowski, A. Zimek, H.P. Kriegel, On evaluation of outlier rankings and outlier scores, in Proceedings of the 2012 SIAM International Conference on Data Mining (SIAM, 2012), pp. 1047–1058
Google Scholar
N. Segata, E. Blanzieri, Fast and scalable local kernel machines. J. Mach. Learn. Res. 11, 1883–1926 (2010)
MathSciNet MATH Google Scholar
M. Smith, T. Martinez, Improving classification accuracy by identifying and removing instances that should be misclassified, in Neural Networks (IJCNN), The 2011 International Joint Conference on IEEE (2011), pp. 2690–2697
Google Scholar
M. Smith, T. Martinez, A comparative evaluation of curriculum learning with filtering and boosting in supervised classification problems. Comput. Intell. 32(2), 167–195 (2016)
Article MathSciNet Google Scholar
I. Tomek, An experiment with the edited nearest-neighbor rule. IEEE Trans. Syst. Man Cybern. SMC–6(6), 448–452 (1976). https://doi.org/10.1109/TSMC.1976.4309523
Article MathSciNet MATH Google Scholar
E. Tsiporkova, V. Boeva, Nonparametric recursive aggregation process. Kybern. J. Czech Soc. Cybern. Inf. Sci. 40(1), 51–70 (2004)
MathSciNet MATH Google Scholar
L. Vendramin, R. Campello, E.R. Hruschka, Relative clustering validity criteria: A comparative overview. Stat. Anal. Data Min. ASA Data Sci. J. 3(4), 209–235 (2010)
MathSciNet Google Scholar
L. Vendramin, P. Jaskowiak, R. Campello, On the combination of relative clustering validity criteria, in Proceedings of the 25th International Conference on Scientific and Statistical Database Management (ACM, 2013), p. 4
Google Scholar
D. Xu, Y. Tian, A comprehensive survey of clustering algorithms. Ann. Data Sci. 2(2), 165–193 (2015)
Article MathSciNet Google Scholar
X. Zeng, T.R. Martinez, An algorithm for correcting mislabeled data. Intell. Data Anal. 5(6), 491–502 (2001)
Article Google Scholar
A. Zimek, R.J. Campello, J. Sander, Ensembles for unsupervised outlier detection: Ehallenges and research questions a position paper. Acm Sigkdd Explor. Newsl. 15(1), 11–22 (2014)
Google Scholar
A. Zimek, E. Schubert, H.P. Kriegel, A survey on unsupervised outlier detection in high-dimensional numerical data. Stat. Anal. Data Min. ASA Data Sci. J. 5(5), 363–387 (2012)
Article MathSciNet Google Scholar

Download references

Acknowledgements

This work is part of the research project “Scalable resource efficient systems for big data analytics” funded by the Knowledge Foundation (grant: 20140032) in Sweden.

Author information

Authors and Affiliations

acs Plus GmbH, Berlin, Germany
Jan Kohstall
Blekinge Institute of Technology, Karlskrona, Sweden
Veselka Boeva & Lars Lundberg
Technical University of Sofia, Plovdiv, Bulgaria
Milena Angelova

Authors

Jan Kohstall
View author publications
You can also search for this author in PubMed Google Scholar
Veselka Boeva
View author publications
You can also search for this author in PubMed Google Scholar
Lars Lundberg
View author publications
You can also search for this author in PubMed Google Scholar
Milena Angelova
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Veselka Boeva .

Editor information

Editors and Affiliations

Department of Electrical Engineering and Computers, Faculdade de Ciências e Tecnologia, UNINOVA-CTS Centre of Technology and Systems, Universidade Nova de Lisboa, Caparica, Portugal
Ricardo Jardim-Goncalves
Institute of Information and Communication Technologies, Bulgarian Academy of Sciences, Sofia, Bulgaria
Vassil Sgurev
University of Library Studies and Information Technologies, Sofia, Bulgaria
Vladimir Jotsov
Systems Research Institute, Polish Academy of Sciences, Warsaw, Poland
Janusz Kacprzyk

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Kohstall, J., Boeva, V., Lundberg, L., Angelova, M. (2020). Ensembles of Cluster Validation Indices for Label Noise Filtering. In: Jardim-Goncalves, R., Sgurev, V., Jotsov, V., Kacprzyk, J. (eds) Intelligent Systems: Theory, Research and Innovation in Applications. Studies in Computational Intelligence, vol 864. Springer, Cham. https://doi.org/10.1007/978-3-030-38704-4_4

Download citation

DOI: https://doi.org/10.1007/978-3-030-38704-4_4
Published: 04 March 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-38703-7
Online ISBN: 978-3-030-38704-4
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)

Publish with us

Policies and ethics