Advertisement

Springer Nature is making Coronavirus research free. View research | View latest news | Sign up for updates

Clustering data with the presence of attribute noise: a study of noise completely at random and ensemble of multiple k-means clusterings

  • 75 Accesses

Abstract

In general practice, the perception of noise has been inevitably negative. Specific to data analytic, most of the existing techniques developed thus far comply with a noise-free assumption. Without an assistance of data pre-processing, it is hard for those models to discover reliable patterns. This is also true for k-means, one of the most well known algorithms for cluster analysis. Based on several works in the literature, they suggest that the ensemble approach can deliver accurate results from multiple clusterings of data with noise completely at random. Provided this motivation, the paper presents the study of using different consensus clustering techniques to analyze noisy data, with k-means being exploited as base clusterings. The empirical investigation reveals that the ensemble approach can be robust to low level of noise, while some exhibit improvement over the noise-free cases. This finding is in line with the recent published work that underlines the benefit of small noise to centroid-based clustering methods. In addition, the outcome of this research provides a guideline to analyzing a new data collection of uncertain quality level.

This is a preview of subscription content, log in to check access.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18

References

  1. 1.

    Agrawal P, Sarma AD, Ullman J, Widom J (2010) Foundations of uncertain-data integration. Proc VLDB Endow 3(1–2):1080–1090

  2. 2.

    Aidos H, Carreiras C, Silva H, Fred A (2013) Evidence accumulation approach applied to EEQ analysis. In: Proceedings of international conference on pattern recognition applications and methods, pp 479–484

  3. 3.

    Asuncion A, Newman DJ (2007) UCI machine learning repository. Irvine University of California, Irvine

  4. 4.

    Balcan MF, Liang Y, Gupta P (2014) Robust hierarchical clustering. J Mach Learn Res 15:4011–4051

  5. 5.

    Bernecker T, Kriegel HP, Renz M, Verhein F, Zufle A (2009) Probabilistic frequent itemset mining in uncertain databases. In: Proceedings of ACM SIGKDD international conference on knowledge discovery and data mining, pp 119–128

  6. 6.

    Bshouty NH, Jackson JC, Tamon C (2003) Uniform-distribution attribute noise learnability. Inf Comput 187(2):277–290

  7. 7.

    Chan E, Ching W, Ng M, Huang J (2004) An optimization algorithm for clustering using weighted dissimilarity measures. Pattern Recognit 37(5):943–952

  8. 8.

    Cooke EJ, Savage RS, Kirk PDW, Darkins R, Wild DL (2011) Bayesian hierarchical clustering for microarray time series data with replicates and outlier measurements. BMC Bioinform 12(399):1–12

  9. 9.

    Deshpande A, Guestrin C, Madden SR, Hellerstein JM, Hong W (2005) Model-based approximate querying in sensor networks. Int J Very Large Data Bases 14(4):417–443

  10. 10.

    Domeniconi C, Al-Razgan M (2009) Weighted cluster ensembles: methods and analysis. ACM Trans Knowl Discov Data 2(4):1–40

  11. 11.

    Fern XZ, Brodley CE (2004) Solving cluster ensemble problems by bipartite graph partitioning. In: Proceedings of international conference on machine learning, pp 36–43

  12. 12.

    Fisher DH (1987) Knowledge acquisition via incremental conceptual clustering. Mach Learn 2:139–172

  13. 13.

    Fred ALN, Jain AK (2005) Combining multiple clusterings using evidence accumulation. IEEE Trans Pattern Anal Mach Intell 27(6):835–850

  14. 14.

    Frenay B, Verleysen M (2014) Classification in the presence of label noise: a survey. IEEE Trans Neural Netw Learn Syst 25(5):845–869

  15. 15.

    Garcia-Escudero LA, Gordaliza A, Matran C, Mayo-Iscar A (2008) A general trimming approach to robust cluster analysis. Ann Stat 36(3):1324–1345

  16. 16.

    Ghinita G, Karras P, Kalnis P, Mamoulis N (2007) Fast data anonymization with low information loss. In: Proceedings of international conference on very large data bases, pp 758–769

  17. 17.

    Gionis A, Mannila H, Tsaparas P (2007) Clustering aggregation. ACM Trans Knowl Discov Data 1(1):4

  18. 18.

    Gullo F, Tagarelli A (2012) Uncertain centroid based partitional clustering of uncertain data. Proc VLDB Endow 5(7):610–621

  19. 19.

    Gullo F, Ponti G, Tagarelli A (2013) Minimizing the variance of cluster mixture models for clustering uncertain objects. Stat Anal Data Min 6(2):116–135

  20. 20.

    He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of IEEE conference on computer vision and pattern recognition, pp 770–778

  21. 21.

    Huang D, Lai J, Wang CD (2016) Ensemble clustering using factor graph. Pattern Recognit 50:131–142

  22. 22.

    Huang D, Lai JH, Wang CD (2016) Robust ensemble clustering using probability trajectories. IEEE Trans Knowl Data Eng 28(5):1312–1326

  23. 23.

    Huang D, Wang CD, Lai JH (2018) Locally weighted ensemble clustering. IEEE Trans Cybern 48(5):1460–1473

  24. 24.

    Huang J, Ng M, Rong H, Li Z (2005) Automated variable weighting in k-means type clustering. IEEE Trans Pattern Anal Mach Intell 27(5):657–668

  25. 25.

    Huang X, Ye Y, Zhang H (2014) Extensions of kmeans-type algorithms: a new clustering framework by integrating intracluster compactness and intercluster separation. IEEE Trans Neural Netw Learn Syst 25(8):1433–1446

  26. 26.

    Hulse JDV, Khoshgoftaar TM, Huang H (2007) The pairwise attribute noise detection algorithm. Knowl Inf Syst 11(2):171–190

  27. 27.

    Iam-On N, Boongoen T (2013) Pairwise similarity for cluster ensemble problem: link-based and approximate approaches. Trans Large Scale Data Knowl Centered Syst 9:95–122

  28. 28.

    Iam-On N, Boongoen T (2015) Comparative study of matrix refinement approaches for ensemble clustering. Mach Learn 98(1–2):269–300

  29. 29.

    Iam-On N, Boongoen T, Garrett S (2010) LCE: a link-based cluster ensemble method for improved gene expression data analysis. Bioinformatics 26(12):1513–1519

  30. 30.

    Iam-On N, Boongoen T, Garrett S, Price C (2011) A link-based approach to the cluster ensemble problem. IEEE Trans Pattern Anal Mach Intell 33(12):2396–2409

  31. 31.

    Jain AK (2010) Data clustering: 50 years beyond k-means. Pattern Recognit Lett 31(8):651–666

  32. 32.

    Jiang B, Pei J, Tao Y, Lin X (2013) Clustering uncertain data based on probability distribution similarity. IEEE Trans Knowl Data Eng 25(4):751–763

  33. 33.

    Jurek A, Nugent C, Bi Y, Wu S (2014) Clustering-based ensemble learning for activity recognition in smart homes. Sensors 14:12,285–12,304

  34. 34.

    Kao B, Lee SD, Cheung DW, Ho WS, Chan KF (2008) Clustering uncertain data using voronoi diagrams. In: Proceedings of IEEE international conference on data mining, pp 333–342

  35. 35.

    Karypis G, Kumar V (1998) A fast and high quality multilevel scheme for partitioning irregular graphs. SIAM J Sci Comput 20(1):359–392

  36. 36.

    Karypis G, Kumar V (1998) Multilevel k-way partitioning scheme for irregular graphs. J Parallel Distrib Comput 48(1):96–129

  37. 37.

    Karypis G, Kumar V (1998) A parallel algorithm for multilevel graph-partitioning and sparse matrix ordering. J Parallel Distrib Comput 48(1):71–95

  38. 38.

    Karypis G, Aggarwal R, Kumar V, Shekhar S (1999) Multilevel hypergraph partitioning: applications in VLSI domain. IEEE Trans VLSI Syst 7(1):69–79

  39. 39.

    Kerr MK, Churchill G (2001) Bootstrapping cluster analysis: assessing the reliability of conclusions from microarray experiments. Proc Natl Acad Sci 98:8961–8965

  40. 40.

    Kim E, Kim S, Ashlock D, Nam D (2009) MULTI-K: accurate classification of microarray subtypes using ensemble k-means clustering. BMC Bioinform 10:260

  41. 41.

    Kim H, Thiagarajan JJ, Bremer P (2014) Image segmentation using consensus from hierarchical segmentation ensembles. In: Proceedings of IEEE international conference on image processing, pp 3272 – 3276

  42. 42.

    Kriegel HP, Kroger P, Sander J, Zimek A (2011) Density-based clustering. WIREs Data Min Knowl Discov 1(3):231–240

  43. 43.

    Mantas CJ, Abellan J, Castellano JG (2016) Analysis of credal-c4.5 for classification in noisy domains. Expert Syst Appl 61:314–326

  44. 44.

    McQueen J (1967) Some methods for classification and analysis of multivariate observations. In: Proceedings of the Fifth Berkeley symposium on mathematical statistics and probability, pp 281–297

  45. 45.

    Medvedovic M, Yeung KY, Bumgarner RE (2004) Bayesian mixture model based clustering of replicated microarray data. Bioinformatics 20:1222–1232

  46. 46.

    Mirkin B (2001) Reinterpreting the category utility function. Mach Learn 45:219–228

  47. 47.

    Mirylenka K, Giannakopoulos G, Do LM, Palpanas T (2017) On classifier behavior in the presence of mislabeling noise. Data Min Knowl Discov 31(3):661–701

  48. 48.

    Monti S, Tamayo P, Mesirov JP, Golub TR (2003) Consensus clustering: a resampling-based method for class discovery and visualization of gene expression microarray data. Mach Learn 52(1–2):91–118

  49. 49.

    Ng A, Jordan M, Weiss Y (2001) On spectral clustering: analysis and an algorithm. Adv Neural Inf Process Syst 14:849–856

  50. 50.

    Ngai WK, Kao B, Chui CK, Cheng R, Chau M, Yip KY (2006) Efficient clustering of uncertain data. In: Proceedings of IEEE international conference on data mining, pp 436–445

  51. 51.

    Nguyen N, Caruana R (2007) Consensus clusterings. In: Proceedings of IEEE international conference on data mining, pp 607–612

  52. 52.

    Osoba O, Kosko B (2013) Noise-enhanced clustering and competitive learning algorithms. Neural Netw 37:132–140

  53. 53.

    Osoba O, Kosko B (2016) The noisy expectation-maximization algorithm for multiplicative noise injection. Fluct Noise Lett 15(1):1–23

  54. 54.

    Ronan T, Qi Z, Naegle KM (2016) Avoiding common pitfalls when clustering biological data. Sci Signal 9(432):1–13

  55. 55.

    Santos CP, Carvalho DM, Nascimento M (2016) A consensus graph clustering algorithm for directed networks. Expert Syst Appl 54:121–135

  56. 56.

    Sloutsky R, Jimenez N, Swamidass SJ, Naegle KM (2013) Accounting for noise when clustering biological data. Brief Bioinform 14:423–436

  57. 57.

    Sluban B, Gamberger D, Lavrac N (2014) Ensemble-based noise detection: noise ranking and visual performance evaluation. Data Min Knowl Discov 28(2):265–303

  58. 58.

    Strehl A, Ghosh J (2002) Cluster ensembles: a knowledge reuse framework for combining multiple partitions. J Mach Learn Res 3:583–617

  59. 59.

    Sun L, Cheng R, Cheung DW, Cheng J (2010) Mining uncertain data with probabilistic guarantees. In: Proceedings of ACM SIGKDD international conference on knowledge discovery and data mining, pp 273–282

  60. 60.

    Tijms H (2004) Understanding probability: chance rules in everyday life. Cambridge University Press, Cambridge

  61. 61.

    Topchy AP, Jain AK, Punch WF (2005) Clustering ensembles: models of consensus and weak partitions. IEEE Trans Pattern Anal Mach Intell 27(12):1866–1881

  62. 62.

    Weng F, Jiang Q, Chen L, Hong Z (2007) Clustering ensemble based on the fuzzy KNN algorithm. In: Proceedings of international conference on software engineering, artificial intelligence, networking, and parallel/distributed computing, pp 1001–1006

  63. 63.

    Xiao W, Yang Y, Wang H, Li T, Xing H (2016) Semi-supervised hierarchical clustering ensemble and its application. Neurocomputing 173:362–1376

  64. 64.

    Yu Z, Wong HS (2009) Class discovery from gene expression data based on perturbation and cluster ensemble. IEEE Trans NanoBiosci 8(2):147–160

  65. 65.

    Zhang H, Chow TWS, Wu QMJ (2016) Organizing books and authors by multilayer som. IEEE Trans Neural Netw Learn Syst 27(12):2537–2550

  66. 66.

    Zhong C, Yue X, Zhang Z, Lei J (2015) A clustering ensemble: two-level-refined co-association matrix with path-based transformation. Pattern Recognit 48:2699–2709

  67. 67.

    Zhu X, Wu X (2004) Class noise vs attribute noise: a quantitative study of their impacts. Artif Intell Rev 22(3–4):177–210

  68. 68.

    Zimek A, Schubert E, Kriegel HP (2012) A survey on unsupervised outlier detection in high-dimensional numerical data. Stat Anal Data Min 5(5):363–387

Download references

Acknowledgements

This work is funded by IAPP1-100077 (Newton RAE-TRF): Anomaly Traffic Identification through Artificial Intelligence, Cyber Security and Big Data Analytics Technologies. It is also partly supported by Mae Fah Luang University.

Author information

Correspondence to Natthakan Iam-On.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Iam-On, N. Clustering data with the presence of attribute noise: a study of noise completely at random and ensemble of multiple k-means clusterings. Int. J. Mach. Learn. & Cyber. 11, 491–509 (2020). https://doi.org/10.1007/s13042-019-00989-4

Download citation

Keywords

  • Data clustering
  • Attribute noise
  • NCAR
  • Cluster ensemble
  • Robustness