Advertisement

Optimization of the Numeric and Categorical Attribute Weights in KAMILA Mixed Data Clustering Algorithm

  • Nádia Junqueira MartarelliEmail author
  • Marcelo Seido Nagano
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11871)

Abstract

The mixed data clustering algorithms have been timidly emerging since the end of the last century. One of the last algorithms proposed for this data-type has been KAMILA (KAy-means for MIxed LArge data) algorithm. While the KAMILA has outperformed the previous mixed data algorithms results, it has some gaps. Among them is the definition of numerical and categorical variable weights, which is a user-defined parameter or, by default, equal to one for all features. Hence, we propose an optimization algorithm called Biased Random-Key Genetic Algorithm for Features Weighting (BRKGAFW) to accomplish the weighting of the numerical and categorical variables in the KAMILA algorithm. The experiment relied on six real-world mixed data sets and two baseline algorithms to perform the comparison, which are the KAMILA with default weight definition, and the KAMILA with weight definition done by the traditional genetic algorithm. The results have revealed the proposed algorithm overperformed the baseline algorithms results in all data sets.

Keywords

Attributes weighting Mixed data clustering Biased Random-Key Genetic Algorithm KAMILA algorithm 

References

  1. 1.
    Ahmad, A., Khan, S.S.: Survey of state-of-the-art mixed data clustering algorithms. IEEE Access 7, 31883–31902 (2019).  https://doi.org/10.1109/ACCESS.2019.2903568CrossRefGoogle Scholar
  2. 2.
    Berkhin, P.: A survey of clustering data mining techniques. In: Kogan, J., Nicholas, C., Teboulle, M. (eds.) Grouping Multidimensional Data, pp. 25–71. Springer, Heidelberg (2006).  https://doi.org/10.1007/3-540-28349-8_2CrossRefGoogle Scholar
  3. 3.
    Foss, A., Markatou, M.: KAMILA: clustering mixed-type data in R and hadoop. J. Stat. Softw. 83(13), 1–44 (2018).  https://doi.org/10.18637/jss.v083.i13CrossRefGoogle Scholar
  4. 4.
    Foss, A., Markatou, M., Ray, B., Heching, A.: A semiparametric method for clustering mixed data. Mach. Learn. 105(3), 419–458 (2016).  https://doi.org/10.1007/s10994-016-5575-7MathSciNetCrossRefzbMATHGoogle Scholar
  5. 5.
    Framinan, J.M., Nagano, M.S.: Evaluating the performance for makespan minimisation in no-wait flowshop sequencing. J. Mater. Process. Technol. 197(1–3), 1–9 (2008).  https://doi.org/10.1016/j.jmatprotec.2007.07.039CrossRefGoogle Scholar
  6. 6.
    Gonçalves, J.A., Almeida, J.F., Raimundo, J.: A hybrid genetic algorithm for assembly line balancing. J. Heuristics 8, 629–642 (2002).  https://doi.org/10.1023/A:1020377910258CrossRefGoogle Scholar
  7. 7.
    Gonçalves, J.F.: A hybrid genetic algorithm-heuristic for a two-dimensional orthogonal packing problem. Eur. J. Oper. Res. 183, 1212–1229 (2007).  https://doi.org/10.1016/j.ejor.2005.11.062MathSciNetCrossRefzbMATHGoogle Scholar
  8. 8.
    Gonçalves, J.F., Mendes, J.J.M., Resende, M.G.C.: A hybrid genetic algorithm for the job shop scheduling problem. Eur. J. Oper. Res. 167, 77–95 (2005).  https://doi.org/10.1016/j.ejor.2004.03.012MathSciNetCrossRefzbMATHGoogle Scholar
  9. 9.
    Gonçalves, J.F., Resende, M.G.C.: Biased random-key genetic algorithms for combinatorial optimization. J. Heuristics 17, 487–525 (2011).  https://doi.org/10.1007/s10732-010-9143-1CrossRefGoogle Scholar
  10. 10.
    Gonçalves, J.F., Resende, M.G.C.: A parallel multi-population genetic algorithm for a constrained two-dimensional orthogonal packing problem. J. Comb. Optim. 22, 180–201 (2011).  https://doi.org/10.1007/s10878-009-9282-1MathSciNetCrossRefGoogle Scholar
  11. 11.
    Gonçalves, J.F., Resende, M.G.C., Mendes, J.J.M.: A biased random-key genetic algorithm with forward-backward improvement for the resource constrained project scheduling problem. J. Heuristics 17, 467–486 (2011).  https://doi.org/10.1007/s10732-010-9142-2CrossRefGoogle Scholar
  12. 12.
    Huang, Z.: Clustering large data sets with mixed numeric and categorical values. In: The First Pacific-Asia Conference on Knowledge Discovery and Data Mining, PAKDD 1997, Singapore, pp. 1–34 (1997)Google Scholar
  13. 13.
    Huang, Z.: Extensions to the k-means algorithm for clustering large data sets with categorical values. Data Min. Knowl. Disc. 2(3), 283–304 (1998)CrossRefGoogle Scholar
  14. 14.
    Jain, A.K., Dubes, R.C.: Algorithms for Clustering Data. Prentice-Hall, Upper Saddle River (1988)zbMATHGoogle Scholar
  15. 15.
    Ji, J., Bai, T., Zhou, C., Ma, C., Wang, Z.: An improved k-prototypes clustering algorithm for mixed numeric and categorical data. Neurocomputing 120(23), 590–596 (2013)CrossRefGoogle Scholar
  16. 16.
    Lichman, M.: UCI machine learning repository (2013)Google Scholar
  17. 17.
    Saxena, A., et al.: A review of clustering techniques and developments. Neurocomputing 267, 664–681 (2017).  https://doi.org/10.1016/j.neucom.2017.06.053CrossRefGoogle Scholar
  18. 18.
    Tan, P.N., Steinbach, M., Kumar, V.: Introduction to Data Mining. Addison-Wesley Longman Publishing Co., Inc., Boston (2005)Google Scholar
  19. 19.
    Wei, M., Chow, T.W.S., Chan, R.H.M.: Clustering heterogeneous data with k-means by mutual information-based unsupervised feature transformation. Entropy 17(3), 1535–1548 (2015)CrossRefGoogle Scholar
  20. 20.
    Wilcoxon, F.: Individual comparisons by ranking methods. Biometrics Bull. 1(6), 80–83 (1945)CrossRefGoogle Scholar
  21. 21.
    Xu, R., Wunsch, D.: Clustering. Wiley-IEEE Press, Hoboken, Piscataway (2009)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  • Nádia Junqueira Martarelli
    • 1
    Email author
  • Marcelo Seido Nagano
    • 1
  1. 1.Laboratory of Applied Operational ResearchUniversity of São PauloSão CarlosBrazil

Personalised recommendations