Categorical Features Transformation with Compact One-Hot Encoder for Fraud Detection in Distributed Environment

  • Ikram Ul HaqEmail author
  • Iqbal Gondal
  • Peter Vamplew
  • Simon Brown
Conference paper
Part of the Communications in Computer and Information Science book series (CCIS, volume 996)


Fraud detection for online banking is an important research area, but one of the challenges is the heterogeneous nature of transactions data i.e. a combination of numeric as well as mixed attributes. Usually, numeric format data gives better performance for classification, regression and clustering algorithms. However, many machine learning problems have categorical, or nominal features, rather than numeric features only. In addition, some machine learning platforms such as Apache Spark accept numeric data only. One-hot Encoding (OHE) is a widely used approach for transforming categorical features to numerical features in traditional data mining tasks. The one-hot approach has some challenges as well: the sparseness of the transformed data and that the distinct values of an attribute are not always known in advance. Other than the model accuracy, compactness of machine learning models is equally important due to growing memory and storage needs. This paper presents an innovative technique to transform categorical features to numeric features by compacting sparse data even if all the distinct values are not known. The transformed data can be used for the development of fraud detection systems. The accuracy of the results has been validated on synthetic and real bank fraud data and a publicly available anomaly detection (KDD-99) dataset on a multi-node data cluster.


One-hot Encoder Compactness Categorical data Distributed computing Hadoop HDFS Spark Machine learning Sparse data 


  1. 1.
    Breunig, M.M., Kriegel, H.-P., Ng, R.T., Sander, J.: LOF: identifying density-based local outliers. In: ACM Sigmod Record (2000)Google Scholar
  2. 2.
    Hodge, V., Austin, J.: A survey of outlier detection methodologies. Artif. Intell. Rev. 22(2), 85–126 (2004)CrossRefGoogle Scholar
  3. 3.
    Jin, H., Chen, J., He, H., Kelman, C., McAullay, D., O’Keefe, C.M.: Signaling potential adverse drug reactions from administrative health databases. IEEE Trans. Knowl. Data Eng. 22(6), 839–853 (2010)CrossRefGoogle Scholar
  4. 4.
    Maruatona, O.: Internet Banking Fraud Detection Using Prudent Analysis. University of Ballarat, Ballarat (2013)Google Scholar
  5. 5.
    Zhang, Y., Meratnia, N., Havinga, P.: Outlier detection techniques for wireless sensor networks: a survey. IEEE Commun. Surv. Tutor. 12(2), 159–170 (2010)CrossRefGoogle Scholar
  6. 6.
    Zhang, K., Jin, H.: An effective pattern based outlier detection approach for mixed attribute data. In: Li, J. (ed.) AI 2010. LNCS (LNAI), vol. 6464, pp. 122–131. Springer, Heidelberg (2010). Scholar
  7. 7.
    Shih, M.-Y., Jheng, J.-W., Lai, L.-F.: A two-step method for clustering mixed categroical. Tamkang J. Sci. Eng. 13(1), 11–19 (2010)Google Scholar
  8. 8.
    Huang, Z.: Clustering large data sets with mixed numeric and categorical values. In: Proceedings of the 1st pacific-asia conference on knowledge discovery and data mining, (PAKDD) (1997)Google Scholar
  9. 9.
    Pentreath, N.: Machine Learning with Spark, p. 338. Packt Publishing, Birmingham (2015)Google Scholar
  10. 10.
    Meng, X., et al.: Mllib: machine learning in apache spark. J. Mach. Learn. Res. 17(34), 1–7 (2016)MathSciNetzbMATHGoogle Scholar
  11. 11.
    Shanahan, J., Dai, L.: Large scale distributed data science using apache spark. In: 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco (2015)Google Scholar
  12. 12.
    Chen, W.: Learning with Scalability and Compactness, p. 147, Washington (2016)Google Scholar
  13. 13.
    Meng, X.: Sparse data support in MLlib. Apache Spark Community, San Francisco (2014)Google Scholar
  14. 14.
    Tavallaee, M., Bagheri, E., Lu, W., Ghorbani, A.A.: A detailed analysis of the KDD CUP 99 data set. In: IEEE Symposium on Computational Intelligence for Security and Defense Applications 2009. CISDA 2009, Ottawa, Canada (2009)Google Scholar
  15. 15.
    Jian, S., Cao, L., Pang, G., Lu, K., Gao, H.: Embedding-based representation of categorical data by hierarchical value coupling learning. In: Proceedings of the 26th International Joint Conference on Artificial Intelligence (2017)Google Scholar
  16. 16.
    Qian, Y., Li, F., Liang, J., Liu, B., Dang, C.: Space structure and clustering of categorical data. IEEE Trans. Neural Netw. Learn. Syst. 27(10), 2047–2059 (2016)MathSciNetCrossRefGoogle Scholar
  17. 17.
    Boriah, S., Chandola, V., Kumar, V.: Similarity measures for categorical data: a comparative evaluation. In: Proceedings of the 2008 SIAM International Conference on Data Mining. Society for Industrial and Applied Mathematics (2008)Google Scholar
  18. 18.
    Anderberg, M.R.: Cluster Analysis for Applications. Academic Press, New York (1973)zbMATHGoogle Scholar
  19. 19.
    Hartigan, J.A.: Cluster Algorithms, vol. 214, p. 1993. Wiley, New York (1975)Google Scholar
  20. 20.
    Jain, A.K., Dubes, R.C.: Algorithms for Clustering Data. Prentice-Hall, NJ (1988)zbMATHGoogle Scholar
  21. 21.
    Ul Haq, I., Gondal, I., Vamplew, P., Layton, R.: Generating synthetic datasets for experimental validation of fraud detection. In: Fourteenth Australasian Data Mining Conference, Canberra, Australia. Conferences in Research and Practice in Information Technology, vol. 170, Canberra (2016)Google Scholar
  22. 22.
    Apache Software Foundation: Apache Hadoop, 26 April 2015.

Copyright information

© Springer Nature Singapore Pte Ltd. 2019

Authors and Affiliations

  • Ikram Ul Haq
    • 1
    Email author
  • Iqbal Gondal
    • 1
  • Peter Vamplew
    • 1
  • Simon Brown
    • 2
  1. 1.ICSLSchool of Science, Engineering and Information TechnologyBallaratAustralia
  2. 2.Westpac BankMelbourneAustralia

Personalised recommendations