Skip to main content
Log in

A fast unsupervised preprocessing method for network monitoring

  • Published:
Annals of Telecommunications Aims and scope Submit manuscript

Abstract

Identifying a network misuse takes days or even weeks, and network administrators usually neglect zero-day threats until a large number of malicious users exploit them. Besides, security applications, such as anomaly detection and attack mitigation systems, must apply real-time monitoring to reduce the impacts of security incidents. Thus, information processing time should be as small as possible to enable an effective defense against attacks. In this paper, we present a fast preprocessing method for network traffic classification based on feature correlation and feature normalization. Our proposed method couples a normalization and feature selection algorithms. We evaluate the proposed algorithms against three different datasets for eight different machine learning classification algorithms. Our proposed normalization algorithm reduces the classification error rate when compared with traditional methods. Our feature selection algorithm chooses an optimized subset of features improving accuracy by more than 11% within a 100-fold reduction in processing time when compared to traditional feature selection and feature reduction algorithms. The preprocessing method is performed in batch and streaming data, being able to detect concept-drift.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19
Fig. 20

Similar content being viewed by others

Notes

  1. Features refer to the original set of attributes that describe the data. Variables refer to the input of the machine learning algorithms applied over the data. If no preprocessing method handles the original data, the set of variables and the set of features are the same.

  2. Anonymized data can be asked by sending an email contact to the authors

References

  1. Hu P, Li H, Fu H, Cansever D, Mohapatra P (2015) Dynamic defense strategy against advanced persistent threat with insiders. In: IEEE conference on computer communications (INFOCOM), vol 4, pp 747–755

  2. Andreoni Lopez M, Ferrazani Mattos DM, Duarte OCMB (2016) An elastic intrusion detection system for software networks. Ann Telecommun 71(11):595–605. https://doi.org/10.1007/s12243-016-0506-y

    Article  Google Scholar 

  3. Ferrazani Mattos DM, Duarte OCMB (2016) AuthFlow: authentication and access control mechanism for software defined networking. Ann Telecommun 71(11):607–615. https://doi.org/10.1007/s12243-016-0505-z

    Article  Google Scholar 

  4. Paxson V (1999) Bro: a system for detecting network intruders in real-time. Comput Netw 31(23–24):2435–2463

    Article  Google Scholar 

  5. Roesch M (1999) Snort-lightweight intrusion detection for networks. In: Proceedings of the 13th USENIX Conference on System Administration. USENIX Association, pp 229–238

  6. Vallentin M, Sommer R, Lee J, Leres C, Paxson V, Tierney B (2007) The NIDS cluster: scalable, stateful network intrusion detection on commodity hardware. In: Recent advances in intrusion detection. Springer, Berlin, pp 107–126

  7. Bar A, Finamore A, Casas P, Golab l., Mellia M (2014) Large-scale network traffic monitoring with DBStream, a system for rolling big data analysis. In: 2014 IEEE International Conference on Big Data (Big Data). IEEE, vol 10, pp 165–170

  8. Stonebraker M, Çetintemel U, Zdonik S (2005) The 8 requirements of real-time stream processing. ACM SIGMOD Rec 34(4):42–47

    Article  Google Scholar 

  9. Mayhew M, Atighetchi M, Adler A, Greenstadt R (2015) Use of machine learning in big data analytics for insider threat detection. In: IEEE Military Communications Conference. MILCOM, vol 10, pp 915–922

  10. Mladenić D (2006) Feature selection for dimensionality reduction. In: Saunders C, Grobelnik M, Gunn S, Shawe-Taylor J (eds) Subspace, latent structure and feature selection (slsfs): statistical and optimization perspectives workshop, pp 84–102. Springer, Bohinj

  11. Bifet A, Morales GDF (2014) Big data stream learning with Samoa. In: 2014 IEEE International Conference on Data Mining Workshop, pp 1199–1202

  12. Khamassi I, Sayed-Mouchaweh M, Hammami M, Ghédira K (2018) Discussion and review on evolving data streams and concept drift adapting. Evol Syst 9(1):1–23

    Article  Google Scholar 

  13. Rahm E, Do HH (2000) Data cleaning: problems and current approaches. IEEE Bullet Tech Comm Data Eng 23(4):3–13

    Google Scholar 

  14. García S, Luengo J, Herrera F (2016) Data preprocessing in data mining. Springer, Berlin

    Google Scholar 

  15. Robnik-Šikonja M, Kononenko I (2003) Theoretical and empirical analysis of ReliefF and RReliefF. Mach Learn 53(1/2):23– 69

    Article  MATH  Google Scholar 

  16. Schölkopf B, Smola AJ, Müller K-R (1999) Kernel principal component analysis. In: Advances in kernel methods. MIT Press, Cambridge, pp 327–352

  17. García S, Luengo J, Herrera F (2016) Tutorial on practical tips of the most influential data preprocessing algorithms in data mining. Knowl-Based Syst 98:1–29. [Online]. Available: http://linkinghub.elsevier.com/retrieve/pii/S0950705115004785

    Article  Google Scholar 

  18. Zhang S, Zhang C, Yang Q (2003) Data preparation for data mining. Appl Artif Intell 17(5–6):375–381

    Article  Google Scholar 

  19. Tan S (2005) Neighbor-weighted k-nearest neighbor for unbalanced text corpus. Expert Syst Appl 28(4):667–671

    Article  Google Scholar 

  20. Ramérez-Gallego S, Krawczyk B, García S, Woźniak M, Herrera F (2017) A survey on data preprocessing for data stream mining: current status and future directions. Neurocomputing

  21. Van Der Maaten L, Postma E, den Herik J (2009) Dimensionality reduction: a comparative. J Mach Learn Res 10:66–71

    Google Scholar 

  22. Ang JC, Mirzal A, Haron H, Hamed HNA (2016) Supervised, unsupervised, and semi-supervised feature selection: a review on gene selection. IEEE/ACM Trans Comput Biol Bioinform 13(5):971–989

    Article  Google Scholar 

  23. Chandrashekar G, Sahin F (2014) A survey on feature selection methods. Comput Electr Eng 40(1):16–28

    Article  Google Scholar 

  24. Guyon I, Weston J, Barnhill S, Vapnik V (2002) Gene selection for cancer classification using support vector machines. Mach Learn 46(1-3):389–422

    Article  MATH  Google Scholar 

  25. Hall MA (1999) Correlation-based feature selection for machine learning. Ph.D. dissertation, The University of Waikato

  26. Kumar A, Sung M, Xu JJ, Wang J (2004) Data streaming algorithms for efficient and accurate estimation of flow size distribution. In: ACM SIGMETRICS performance evaluation review. ACM, vol 132, no. 1, pp 177-188

  27. Ben-Haim Y, Tom-tov E (2010) A streaming parallel decision tree algorithm. J Mach Learn Res 11:849–872

    MathSciNet  MATH  Google Scholar 

  28. Webb GI (2014) Contrary to popular belief incremental discretization can be sound, computationally efficient and extremely useful for streaming data. In: IEEE International Conference on Data Mining (ICDM). IEEE, pp 1031–1036

  29. Tavallaee M, Bagheri E, Lu W, Ghorbani AA (2009) A detailed analysis of the KDD CUP 99 data set. In: 2009 IEEE Symposium on Computational Intelligence for Security and Defense Applications, pp 1–6

  30. Lobato A, Andreoni Lopez M, Sanz IJ, Cárdenas A, Duarte OCMB, Pujolle G (2018) An adaptive real-time architecture for zero-day threat detection. In: IEEE ICC 2018 Next Generation Networking and Internet Symposium (ICC’18 NGNI), Kansas City, USA

  31. Andreoni Lopez M, Silva RS, Alvarenga ID, Rebello GAF, Sanz IJ, Lobato AGP, Mattos DMF, Duarte OCMB, Pujolle G (2017) Collecting and characterizing a real broadband access network traffic dataset. In: IEEE/IFIP 1st Cyber Security in Networking Conference (CSNet), pp 1–8

  32. Hu H, Kantardzic M (2016) Smart preprocessing improves data stream mining. In: 49th Hawaii International Conference on System Sciences (HICSS). IEEE, pp 1749–1757

  33. Buczak AL, Guven E (2016) A survey of data mining and machine learning methods for cyber security intrusion detection. IEEE Commun Surv Tutorials 18(2):1153–1176. https://doi.org/10.1109/COMST.2015.2494502

    Article  Google Scholar 

  34. Prasath VBS, Alfeilat HAA, Lasassmeh O, Hassanat ABA Distance and similarity measures effect on the performance of k-nearest neighbor classifier - a review, CoRR. [Online]. arXiv:1708.04321

  35. Zhang T (2004) Solving large scale linear prediction problems using stochastic gradient descent algorithms. In: Proceedings of the Twenty-First International Conference on Machine Learning. ACM, pp 116

  36. Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) SMOTE: synthetic minority over-sampling technique. J Artif Intell Res 16:321–357

    Article  MATH  Google Scholar 

  37. Perkins S, Theiler J (2003) Online feature selection using grafting. In: Proceedings of the 20th International Conference on Machine Learning (ICML-03), pp 592–599

  38. Zhou J, Foster DP, Stine RA, Ungar LH (2006) Streamwise feature selection. J Mach Learn Res 7 (Sep):1861–1885

    MathSciNet  MATH  Google Scholar 

  39. Wu X, Yu K, Ding W, Wang H, Zhu X (2013) Online feature selection with streaming features. IEEE Trans Pattern Anal Mach Intell 35(5):1178–1192

    Article  Google Scholar 

Download references

Acknowledgments

The authors would like to thank Antonio Lobato, Igor Alvarenga, and Igor Sanz for their significant contributions to obtain the results.

Funding

This research is supported by CNPq, CAPES, FAPERJ, and FAPESP (2015/24514-9, 2015/24485-9, and 2014/50937-1).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Martin Andreoni Lopez.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Andreoni Lopez, M., Mattos, D.M.F., Duarte, O.C.M.B. et al. A fast unsupervised preprocessing method for network monitoring. Ann. Telecommun. 74, 139–155 (2019). https://doi.org/10.1007/s12243-018-0663-2

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s12243-018-0663-2

Keywords

Navigation