Non-standard Distances in High Dimensional Raw Data Stream Classification

Ząbkiewicz, Kamil

doi:10.1007/978-3-030-39250-5_5

Non-standard Distances in High Dimensional Raw Data Stream Classification

Kamil Ząbkiewicz⁵

Chapter
First Online: 13 February 2020

749 Accesses

Part of the book series: Studies in Computational Intelligence ((SCI,volume 869))

Abstract

In this paper, we present a new approach for classifying high dimensional raw (or close to raw) data streams. It is based on k-nearest neighbour (kNN) classifier. The novelty of the proposed solution is based on non-standard distances, which are computed from compression and hashing methods. We use the term “non-standard” to emphasize the method by which proposed distances are computed. Standard distances, such as Euclidean, Manhattan, Mahalanobis, etc. are calculated from numerical features that describe data. The non-standard approach is not necessarily based on extracted features - we can use raw (not preprocessed) data. The proposed method does not need to select or extract features. Experiments were performed on the datasets having dimensionality larger than 1000 features. Results show that the proposed method in most cases performs better than or similarly to other standard stream classification algorithms. All experiments and comparisons were performed in a Massive Online Analysis (MOA) environment.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 139.00; Price excludes VAT (USA)

Softcover Book: USD 179.99; Price excludes VAT (USA)

Hardcover Book: USD 179.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

Aggarwal CC (2014) A survey of stream classification algorithms. In: Aggarwal CC (ed) Data classification: algorithms and applications, 25 July 2014. Chapman and Hall/CRC, pp 245–273
Google Scholar
Bifet A, Holmes G, Kirkby R, Pfahringer B (2010) MOA: massive online analysis. J Mach Learn Res 11:1601–1604
Google Scholar
Bifet A, Pfahringer B, Read J, Holmes G (2013) Efficient data stream classification via probabilistic adaptive windows. In: Proceedings of the 28th annual ACM symposium on applied computing, pp 801–806
Google Scholar
Bifet A, de Francisci Morales G, Read J, Holmes G, Pfahringer B (2015) Efficient online evaluation of big data stream classifiers. In: Proceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining—KDD ’15. Sydney, NSW, Australia, pp 59–68
Google Scholar
Bifet A, Gavaldà R, Holmes G, Pfahringer B (2018) Machine learning for data streams with practical examples in MOA. MIT Press
Google Scholar
Brzezinski D, Stefanowski J (2017) Prequential AUC: properties of the area under the ROC curve for data streams with concept drift. Knowl Inf Syst 52(2):531–562
Article Google Scholar
Cilibrasi R (2007) Statistical inference through data compression. Ph.D. thesis, Institute for Logic, Language and Computation, University of Amsterdam
Google Scholar
Cilibrasi R, Vitanyi PMB (2005) Clustering by compression. IEEE Trans Inf Theory 51(4):1523–1545
Article MathSciNet Google Scholar
Clifford GD, Liu C, Moody B, Millet J, Schmidt S, Li Q, Silva I, Mark RG (2017) Recent advances in heart sound analysis. Physiol Meas 38:E10–E25
Article Google Scholar
Cohen AR, Vitanyi PMB (2015) Normalized compression distance of multisets with applications. IEEE Trans Pattern Anal Mach Intell 37(8):1602–1614
Article Google Scholar
Ditzler G, Roveri G, Alippi MC, Polikar R (2015) Learning in nonstationary environments: a survey. IEEE Comput Intell Mag 10(4):12–25
Article Google Scholar
Fawcett T (2006) An introduction to ROC analysis. Pattern Recogn Lett 27:861–874
Article Google Scholar
Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP, Coller H, Loh ML, Downing JR, Caligiuri MA, Bloomfield CD, Lander ES (1999) Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286:531
Article Google Scholar
Krawczyk B, Minku LL, Gama J, Stefanowski J, Woźniak M (2017) Ensemble learning for data stream analysis: a survey. Inf Fusion 37:132–156
Article Google Scholar
Loeffel P-X (2017) Adaptive machine learning algorithms for data streams subject to concept drifts. Ph.D. thesis, Université Pierre et Marie Curie, Paris VI
Google Scholar
Losing V, Hammer B, Wersing H (2018) Tackling heterogeneous concept drift with the Self-Adjusting Memory (SAM). Knowl Inf Syst 54(1):171–201
Article Google Scholar
Majnik M, Bosnic Z (2013) ROC analysis of classifiers in machine learning: a survey. Intell Data Anal 17(3):531–558
Article Google Scholar
Raff E, Nicholas C (2017) An alternative to NCD for large sequences, Lempel-Ziv Jaccard distance. In: Proceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and data mining, pp 1007–1015
Google Scholar
Spira A, Beane JE, Shah V, Steiling K et al (2007) Airway epithelial gene expression in the diagnostic evaluation of smokers with suspect lung cancer. Nat Med 13(3):361–366
Article Google Scholar
Stefanowski J, Brzezinski D (2016) Stream Classification. In: Sammut C, Webb GI (eds) Encyclopedia of machine learning and data mining. Springer, US, Boston, MA
Google Scholar
West M, Blanchette C, Dressman H, Huang E, Ishida S, Spang R, Zuzan H, Olson JA, Marks JR, Nevins JR (2001) Predicting the clinical status of human breast cancer by using gene expression profiles. Proc Natl Acad Sci 98(20):11462–11467
Article Google Scholar
Wojnarski M, Janusz A, Nguyen HS, Bazan J, Luo C, Chen Z, Hu F, Wang G, Guan L, Luo H, Gao J, Shen Y, Nikulin V, Huang T-H, McLachlan GJ, Bošnjak M, Gamberger D (2010) RSCTC’ 2010 discovery challenge: mining DNA microarray data for medical diagnosis and treatment. In: Szczuka M, Kryszkiewicz M, Ramanna S, Jensen R, Hu Q (eds) Rough sets and current trends in computing. Springer, Berlin, pp 4–19
Chapter Google Scholar
Zhai T, Gao Y, Wang H, Cao L (2017) Classification of high-dimensional evolving data streams via a resource-efficient online ensemble. Data Min Knowl Disc 31(5):1242–1265
Article MathSciNet Google Scholar

Download references

Acknowledgements

We would like to thank the reviewers for their valuable comments and effort to improve this paper. Computations performed as part of the experiments were carried out at the Computer Center of the University of Bialystok.

Author information

Authors and Affiliations

Faculty of Economics and Informatics in Vilnius, University of Bialystok, Kalvarijų g. 135, 08221, Vilnius, Lithuania
Kamil Ząbkiewicz

Authors

Kamil Ząbkiewicz
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Kamil Ząbkiewicz .

Editor information

Editors and Affiliations

Institute of Data Science and Digital Technologies, Vilnius University, Vilnius, Lithuania
Gintautas Dzemyda
Institute of Data Science and Digital Technologies, Vilnius University, Vilnius, Lithuania
Jolita Bernatavičienė
Systems Research Institute, Polish Academy of Sciences, Warsaw, Poland
Janusz Kacprzyk

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Ząbkiewicz, K. (2020). Non-standard Distances in High Dimensional Raw Data Stream Classification. In: Dzemyda, G., Bernatavičienė, J., Kacprzyk, J. (eds) Data Science: New Issues, Challenges and Applications. Studies in Computational Intelligence, vol 869. Springer, Cham. https://doi.org/10.1007/978-3-030-39250-5_5

Download citation

DOI: https://doi.org/10.1007/978-3-030-39250-5_5
Published: 13 February 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-39249-9
Online ISBN: 978-3-030-39250-5
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)

Publish with us

Policies and ethics