Advertisement

Entropy-based outlier detection using spark

  • Guilan Feng
  • Zhengnan Li
  • Wengang Zhou
  • Shi DongEmail author
Article
  • 5 Downloads

Abstract

The k-nearest neighbors outlier detection is a simple yet effective widely renowned method in data mining. The actual application of this model in the big data domain is not feasible due to time and memory restrictions. Several distributed alternatives based on MapReduce have been proposed to enable this method to handle large-scale data. However, their performance can be further improved with new designs that fit with newly arising technologies. Furthermore, it gives to each attribute the same importance to outlier. There are several approaches to enhance its precision, with the entropy-based outlier detection being among the most successful ones. Entropy-based outlier detection computes attribute entropy of the data set to weighted distance formula for the outlier detection. Apart from the existing the k-nearest neighbors outlier detection to handle big datasets, there is not an entropy-based outlier detection to manage that volume of data. In this paper, we propose an entropy-based outlier detection based on Spark. It presents three separately stages. The first stage computes attribute entropy. The second stage finds the k nearest neighbors and calculates the degrees of outliers using the attribute entropy computed previously. The third stage ranks each point on the degrees of outliers and declares the top n points in this ranking to be outliers. Extensive experimental results show the advantages of the proposed method. This algorithm can improve the outlier detection precision, reduce the runtime and realize the effective large scale dataset outlier detection.

Keywords

Outlier detection Information entropy k nearest neighbors Spark 

Notes

Acknowledgements

This work was supported by the Civil Aviation Flight Data Analysis under No. XM2852 and Key Scientific and Technological Research Projects in Henan Province (Grand No. 192102210125).

References

  1. 1.
    Aggarwal, C.C.: Outlier Analysis. Springer, New York (2015)zbMATHGoogle Scholar
  2. 2.
    Domingues, R., Filippone, M., Michiardi, P., Zouaoui, J.: A comparative evaluation of outlier detection algorithms: experiments and analyses. Pattern Recogn. 74, 406–421 (2017)CrossRefGoogle Scholar
  3. 3.
    Ramaswamy,S., Rastogi, R., Shim, K.: Efficient algorithms for mining outliers from large data sets. In: Paper presented at the ACM SIGMOD International Conference on Management of Data, ACM, pp. 427–438, (2000)Google Scholar
  4. 4.
    Dean, J., Ghemawat, S.: MapReduce: a flexible data processing tool. Commun. ACM 53(1), 72–77 (2010)CrossRefGoogle Scholar
  5. 5.
    White, T.: Hadoop: The Definitive Guide, 4th edn. O’Reilly, Sebastopol (2015)Google Scholar
  6. 6.
    Maillo, J., Ramírez, S., Triguero, I., et al.: kNN-IS: an iterative spark-based design of the k-nearest neighbors classifier for big data. Knowl.-Based Syst. 117, 3–15 (2016)CrossRefGoogle Scholar
  7. 7.
    Maillo, J., Luengo, J., García, S., et al.: Exact fuzzy k-nearest neighbor classification for big datasets. In: Paper presented at the IEEE International Conference on Fuzzy Systems, IEEE, (2017)Google Scholar
  8. 8.
    Zaharia, M., Chowdhury, M., Franklin, M., Shenker, S., Stoica, I.: Spark: cluster computing with working sets. HotCloud 10, 10 (2010)Google Scholar
  9. 9.
    Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauley, M., Franklin, M., Shenker, S., Stoica, I.: Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: Paper presented at the Conference on Networked Systems Design and Implementation, pp. 1–14, (2012)Google Scholar
  10. 10.
    Wu, S., Wang, S.: Information-theoretic outlier detection for large-scale categorical data. IEEE Trans. Knowl. Data Eng. 25, 589–602 (2013)CrossRefGoogle Scholar
  11. 11.
    Subramanyam, R.B.V., Sonam, G.: Map-reduce algorithm for mining outliers in the large data sets using twister programing model. Int. J. Comput. Sci. Electron. Eng. 3(1), 81–86 (2015)Google Scholar
  12. 12.
    Guo, Y.P., Liang, J.Y., Zhao, X.W.: An outlier detection algorithm for mixed data based on MapReduce. J. Chin. Comput. Syst. 35(9), 1961–1966 (2014)Google Scholar
  13. 13.
    Cao, L., Yan, Y., Kuhlman, C., et al.: Multi-tactic distance-based outlier detection. In: Paper presented at the IEEE International Conference on Data Engineering, IEEE, pp. 959–970, (2017)Google Scholar
  14. 14.
    Hu, C.P., Qin, X.L.: A density-based local outlier detecting algorithm. J. Comput. Res. Dev. 47(12), 2110–2116 (2010)Google Scholar
  15. 15.
    Wang, J.H., Zhao, X.X., Zhang, G.Y.: NLOF: a new density-based local outlier detecting algorithm. Comput. Sci. 40(8), 181–185 (2013)Google Scholar
  16. 16.
    Xin, L.L., He, W., Yu, J.: An outlier detection algorithm based on density difference. J. Shangdong Univ. (Eng. Sci.) 45(3), 7–14 (2015)Google Scholar
  17. 17.
    Shannon, C.: A mathematical theory of communication. Bell Syst. Tech. J. 27(379–423), 623–656 (1948)MathSciNetCrossRefzbMATHGoogle Scholar
  18. 18.
    Filippone, M., Sanguinetti, G.: Information theoretic novelty detection. Pattern Recogn. 43(3), 805–814 (2010)CrossRefzbMATHGoogle Scholar
  19. 19.
    Jiang, F., Sui, Y., Cao, C.: An information entropy-based approach to outlier detection in rough sets. Expert Syst. Appl. 37(9), 6338–6344 (2010)CrossRefGoogle Scholar
  20. 20.
    Pang, G., Cao, L., Chen, L., et al.: Unsupervised feature selection for outlier detection by modelling hierarchical value-feature couplings. In: Paper presented at the International Conference on Data Mining, IEEE, pp. 410–419, (2017)Google Scholar
  21. 21.
    Asuncion A.: UCI machine learning repository, (2013)Google Scholar
  22. 22.
    Yan, Y., Cao, L., Kulhman, C., et al: Distributed local outlier detection in big data. In: Paper presented at The ACM SIGKDD International Conference, pp. 1225–1234, (2017)Google Scholar
  23. 23.
    Sarumiab, O.A., Leungb, C.K., Adetunmbi, A.O.: Spark-based data analytics of sequence motifs in large omics data. Procedia Computer Science 136, 596–605 (2018)CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2019

Authors and Affiliations

  • Guilan Feng
    • 1
  • Zhengnan Li
    • 2
  • Wengang Zhou
    • 3
  • Shi Dong
    • 4
    Email author
  1. 1.Modern Education Technology CenterCivil Aviation Flight University of ChinaGuanghanChina
  2. 2.Institute of Aviation EngineeringCivil Aviation Flight University of ChinaGuanghanChina
  3. 3.Institute of Flight TechnologyCivil Aviation Flight University of ChinaGuanghanChina
  4. 4.School of Computer Science and TechnologyZhoukou Normal UniversityZhoukouChina

Personalised recommendations