Advertisement

Non-standard Distances in High Dimensional Raw Data Stream Classification

  • Kamil ZąbkiewiczEmail author
Chapter
  • 20 Downloads
Part of the Studies in Computational Intelligence book series (SCI, volume 869)

Abstract

In this paper, we present a new approach for classifying high dimensional raw (or close to raw) data streams. It is based on k-nearest neighbour (kNN) classifier. The novelty of the proposed solution is based on non-standard distances, which are computed from compression and hashing methods. We use the term “non-standard” to emphasize the method by which proposed distances are computed. Standard distances, such as Euclidean, Manhattan, Mahalanobis, etc. are calculated from numerical features that describe data. The non-standard approach is not necessarily based on extracted features - we can use raw (not preprocessed) data. The proposed method does not need to select or extract features. Experiments were performed on the datasets having dimensionality larger than 1000 features. Results show that the proposed method in most cases performs better than or similarly to other standard stream classification algorithms. All experiments and comparisons were performed in a Massive Online Analysis (MOA) environment.

Keywords

Stream classification High-dimensional data KNN classifier Distance MOA Data compression Hashing 

Notes

Acknowledgements

We would like to thank the reviewers for their valuable comments and effort to improve this paper. Computations performed as part of the experiments were carried out at the Computer Center of the University of Bialystok.

References

  1. Aggarwal CC (2014) A survey of stream classification algorithms. In: Aggarwal CC (ed) Data classification: algorithms and applications, 25 July 2014. Chapman and Hall/CRC, pp 245–273Google Scholar
  2. Bifet A, Holmes G, Kirkby R, Pfahringer B (2010) MOA: massive online analysis. J Mach Learn Res 11:1601–1604Google Scholar
  3. Bifet A, Pfahringer B, Read J, Holmes G (2013) Efficient data stream classification via probabilistic adaptive windows. In: Proceedings of the 28th annual ACM symposium on applied computing, pp 801–806Google Scholar
  4. Bifet A, de Francisci Morales G, Read J, Holmes G, Pfahringer B (2015) Efficient online evaluation of big data stream classifiers. In: Proceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining—KDD ’15. Sydney, NSW, Australia, pp 59–68Google Scholar
  5. Bifet A, Gavaldà R, Holmes G, Pfahringer B (2018) Machine learning for data streams with practical examples in MOA. MIT PressGoogle Scholar
  6. Brzezinski D, Stefanowski J (2017) Prequential AUC: properties of the area under the ROC curve for data streams with concept drift. Knowl Inf Syst 52(2):531–562CrossRefGoogle Scholar
  7. Cilibrasi R (2007) Statistical inference through data compression. Ph.D. thesis, Institute for Logic, Language and Computation, University of AmsterdamGoogle Scholar
  8. Cilibrasi R, Vitanyi PMB (2005) Clustering by compression. IEEE Trans Inf Theory 51(4):1523–1545MathSciNetCrossRefGoogle Scholar
  9. Clifford GD, Liu C, Moody B, Millet J, Schmidt S, Li Q, Silva I, Mark RG (2017) Recent advances in heart sound analysis. Physiol Meas 38:E10–E25CrossRefGoogle Scholar
  10. Cohen AR, Vitanyi PMB (2015) Normalized compression distance of multisets with applications. IEEE Trans Pattern Anal Mach Intell 37(8):1602–1614CrossRefGoogle Scholar
  11. Ditzler G, Roveri G, Alippi MC, Polikar R (2015) Learning in nonstationary environments: a survey. IEEE Comput Intell Mag 10(4):12–25CrossRefGoogle Scholar
  12. Fawcett T (2006) An introduction to ROC analysis. Pattern Recogn Lett 27:861–874CrossRefGoogle Scholar
  13. Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP, Coller H, Loh ML, Downing JR, Caligiuri MA, Bloomfield CD, Lander ES (1999) Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286:531CrossRefGoogle Scholar
  14. Krawczyk B, Minku LL, Gama J, Stefanowski J, Woźniak M (2017) Ensemble learning for data stream analysis: a survey. Inf Fusion 37:132–156CrossRefGoogle Scholar
  15. Loeffel P-X (2017) Adaptive machine learning algorithms for data streams subject to concept drifts. Ph.D. thesis, Université Pierre et Marie Curie, Paris VIGoogle Scholar
  16. Losing V, Hammer B, Wersing H (2018) Tackling heterogeneous concept drift with the Self-Adjusting Memory (SAM). Knowl Inf Syst 54(1):171–201CrossRefGoogle Scholar
  17. Majnik M, Bosnic Z (2013) ROC analysis of classifiers in machine learning: a survey. Intell Data Anal 17(3):531–558CrossRefGoogle Scholar
  18. Raff E, Nicholas C (2017) An alternative to NCD for large sequences, Lempel-Ziv Jaccard distance. In: Proceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and data mining, pp 1007–1015Google Scholar
  19. Spira A, Beane JE, Shah V, Steiling K et al (2007) Airway epithelial gene expression in the diagnostic evaluation of smokers with suspect lung cancer. Nat Med 13(3):361–366CrossRefGoogle Scholar
  20. Stefanowski J, Brzezinski D (2016) Stream Classification. In: Sammut C, Webb GI (eds) Encyclopedia of machine learning and data mining. Springer, US, Boston, MAGoogle Scholar
  21. West M, Blanchette C, Dressman H, Huang E, Ishida S, Spang R, Zuzan H, Olson JA, Marks JR, Nevins JR (2001) Predicting the clinical status of human breast cancer by using gene expression profiles. Proc Natl Acad Sci 98(20):11462–11467CrossRefGoogle Scholar
  22. Wojnarski M, Janusz A, Nguyen HS, Bazan J, Luo C, Chen Z, Hu F, Wang G, Guan L, Luo H, Gao J, Shen Y, Nikulin V, Huang T-H, McLachlan GJ, Bošnjak M, Gamberger D (2010) RSCTC’ 2010 discovery challenge: mining DNA microarray data for medical diagnosis and treatment. In: Szczuka M, Kryszkiewicz M, Ramanna S, Jensen R, Hu Q (eds) Rough sets and current trends in computing. Springer, Berlin, pp 4–19CrossRefGoogle Scholar
  23. Zhai T, Gao Y, Wang H, Cao L (2017) Classification of high-dimensional evolving data streams via a resource-efficient online ensemble. Data Min Knowl Disc 31(5):1242–1265MathSciNetCrossRefGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2020

Authors and Affiliations

  1. 1.Faculty of Economics and Informatics in VilniusUniversity of BialystokVilniusLithuania

Personalised recommendations