Free alignment classification of dikarya fungi using some machine learning methods

  • Abbas Rohani
  • Mojtaba Mamarabadi
Original Article


Gene clustering based on amino acid sequence similarity has been one of the most important problems and always challenging in molecular biology. The most conventional methods are based on alignment-technique. These methods cannot identify and classify sequences, especially when the lengths of sequence are long and unequal. Therefore, in order to classify fungal hexosaminidase amino acid sequences and put them in the right taxonomical group we evaluate the feasibility of computational free alignment methods based on machine learning classifiers such as SVM, KNN, SOM and ensemble technique. The classifiers have appropriately categorized large Dikarya hexosaminidase amino acid sequences as data sets according to their taxonomical groups in two phyla named, the “Ascomycota” and the “Basidiomycota”. Two statistical methods including paired t test and PCA were used for the feature selection and reduce the dimensionality of the features, respectively. Seven classifier performance metrics, randomized complete block design, pairwise Tukey’s honestly significant difference tests and the technique for order preference by similarity to ideal solution with modified k-fold cross validation have been used as tools in order to evaluate and ranking of classifiers. In this study, the effect of training data size on the classifier performance was investigated. The results showed that the rank and the performance of classifiers were depended on the training data size. The highest obtained values for the average overall accuracy of the following training data sizes, 80, 60, 40 and 20% using KNN, KNN, ensemble and ensemble classifier were 96.96, 95.81, 94.47 and 92.47%, respectively.


Fungal hexosaminidase Dikarya Classification Classifier 



Artificial neural network


Analysis of variance


Adaptive rule-based


Area under an ROC curve


Deoxyribonucleic acid


Ensemble classifier


Fungal hexosaminidases


Number of positive samples


Number of negative samples


Honestly significant difference


K-nearest neighbor


Matthew’s correlation coefficient


Multi-criteria decision-making


Multilayer perceptron


Naïve Bayes


Principal component


Principal component analysis


Probability neural network


Polynomial degree 2


Polynomial degree 3


Particle swarm optimization


Radial basic function


Randomized complete block design


Random forest


Self-organizing feature map


Self-organized map


Total sum of squares


Within-groups sum of squares


Support vector machine


Training data size


Two-layer classification framework


Negative samples


Technique for order preference by similarity to ideal solution


Number of positive samples


Youden’s index



Financial support from the vice president for research and technology of Ferdowsi University of Mashhad, is highly appreciated.


  1. 1.
    Hibbett DS, Binder M, Bischoff JF, Blackwell M, Cannon PF, Eriksson OE, Huhndorf S, James T, Kirk PM, Lücking R (2007) A higher-level phylogenetic classification of the Fungi. Mycol Res 111(5):509–547CrossRefGoogle Scholar
  2. 2.
    Taylor JW, Berbee ML (2014) 1 Fungi from PCR to genomics: the spreading revolution in evolutionary biology. In: Systematics and evolution. Springer, Berlin, pp 1–18Google Scholar
  3. 3.
    Sorimachi K, Okayasu T (2013) Phylogenetic tree construction based on amino acid composition and nucleotide content of complete vertebrate mitochondrial genomes. IOSR J Phamacy 3:51–56Google Scholar
  4. 4.
    Larkin MA, Blackshields G, Brown N, Chenna R, McGettigan PA, McWilliam H, Valentin F, Wallace IM, Wilm A, Lopez R (2007) Clustal W and Clustal X version 2.0. Bioinformatics 23(21):2947–2948CrossRefGoogle Scholar
  5. 5.
    Sievers F, Wilm A, Dineen D, Gibson TJ, Karplus K, Li W, Lopez R, McWilliam H, Remmert M, Söding J (2011) Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega. Mol Syst Biol 7(1):539CrossRefGoogle Scholar
  6. 6.
    Notredame C, Higgins DG, Heringa J (2000) T-Coffee: a novel method for fast and accurate multiple sequence alignment. J Mol Biol 302(1):205–217CrossRefGoogle Scholar
  7. 7.
    Do CB, Mahabhashyam MS, Brudno M, Batzoglou S (2005) ProbCons: probabilistic consistency-based multiple sequence alignment. Genome Res 15(2):330–340CrossRefGoogle Scholar
  8. 8.
    Edgar RC (2004) MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucl Acids Res 32(5):1792–1797CrossRefGoogle Scholar
  9. 9.
    Kohonen T (2001) Self-organizing maps. Springer, BerlinzbMATHCrossRefGoogle Scholar
  10. 10.
    Kohonen T, Somervuo P (1998) Self-organizing maps of symbol strings. Neurocomputing 21(1):19–30zbMATHCrossRefGoogle Scholar
  11. 11.
    Chang R-I, Chu C-C, Wu Y-Y, Chen Y-L (2010) Gene clustering by using query-based self-organizing maps. Expert Syst Appl 37(9):6689–6694CrossRefGoogle Scholar
  12. 12.
    Vesanto J, Alhoniemi E (2000) Clustering of the self-organizing map. IEEE Trans Neural Netw 11(3):586–600CrossRefGoogle Scholar
  13. 13.
    Xu R, Wunsch D (2005) Survey of clustering algorithms. IEEE Trans Neural Netw 16(3):645–678CrossRefGoogle Scholar
  14. 14.
    Astel A, Tsakovski S, Barbieri P, Simeonov V (2007) Comparison of self-organizing maps classification approach with cluster and principal components analysis for large environmental data sets. Water Res 41(19):4566–4578CrossRefGoogle Scholar
  15. 15.
    Delgado S, Morán F, Mora A, Merelo JJ, Briones C (2015) A novel representation of genomic sequences for taxonomic clustering and visualization by means of self-organizing maps. Bioinformatics 31(5):736–744CrossRefGoogle Scholar
  16. 16.
    Anke Z, Xinjian Q, Guojian C (2014) Clustering analysis of gene data based on PCA and SOM neural networks. In: Fifth international conference on intelligent systems design and engineering applications (ISDEA), 2014. IEEE, pp 284–287Google Scholar
  17. 17.
    Duda RO, Hart PE, Stork DG (1973) Pattern classification, vol 2. Wiley, New YorkzbMATHGoogle Scholar
  18. 18.
    Wang J, Neskovic P, Cooper LN (2006) Neighborhood size selection in the k-nearest-neighbor rule using statistical confidence. Pattern Recogn 39(3):417–423zbMATHCrossRefGoogle Scholar
  19. 19.
    Agrawala AK (1977) Machine recognition of patterns. IEEE Press, New YorkGoogle Scholar
  20. 20.
    Fix E, Hodges JL (1989) Discriminatory analysis nonparametric discrimination: consistency properties. Int Stat Rev 57(3):238–247zbMATHCrossRefGoogle Scholar
  21. 21.
    Ghosh AK, Chaudhuri P, Murthy C (2005) On visualization and aggregation of nearest neighbor classifiers. IEEE Trans Pattern Anal Mach Intell 27(10):1592–1602CrossRefGoogle Scholar
  22. 22.
    Horton P, Nakai K (1997) Better prediction of protein cellular localization sites with the it k nearest neighbors classifier. In: Ismb, pp 147–152Google Scholar
  23. 23.
    Nathan R, Spiegel O, Fortmann-Roe S, Harel R, Wikelski M, Getz WM (2012) Using tri-axial acceleration data to identify behavioral modes of free-ranging animals: general concepts and tools illustrated for griffon vultures. J Exp Biol 215(6):986–996CrossRefGoogle Scholar
  24. 24.
    Khamis HS, Cheruiyot KW, Kimani S (2014) Application of k-nearest neighbour classification in medical data mining. Int J Inf Commun Technol Res 4:4Google Scholar
  25. 25.
    Medjahed SA, Saadi TA, Benyettou A (2013) Breast cancer diagnosis by using k-nearest neighbor with different distances and classification rules. Int J Comput Appl 62(1):1Google Scholar
  26. 26.
    Deolekar S, Abraham S (2016) Classification of tabla strokes using neural network. In: Computational intelligence in data mining—volume 1. Springer, pp 347–356Google Scholar
  27. 27.
    Modak S, Sharma S, Prabhakar P, Yadav A, Jayaraman V (2013) Application of support vector machines in fungal genome and proteome annotation. In: Laboratory protocols in fungal biology. Springer, pp 565–577Google Scholar
  28. 28.
    Manimekalai K, Vijaya M (2014) Taxonomic classification of Plant species using support vector machine. J Bioinf Intell Control 3(1):65–71CrossRefGoogle Scholar
  29. 29.
    Kittler J, Hatef M, Duin RP, Matas J (1998) On combining classifiers. IEEE Trans Pattern Anal Mach Intell 20(3):226–239CrossRefGoogle Scholar
  30. 30.
    Rahman A, Tasnim S (2014) Ensemble classifiers and their applications: a review. arXiv preprint arXiv:14044088
  31. 31.
    Yang P, Li X, Chua H-N, Kwoh C-K, Ng S-K (2014) Ensemble positive unlabeled learning for disease gene identification. PLoS ONE 9(5):e97079CrossRefGoogle Scholar
  32. 32.
    Mohapatra S, Patra D, Satpathy S (2014) An ensemble classifier system for early diagnosis of acute lymphoblastic leukemia in blood microscopic images. Neural Comput Appl 24(7–8):1887–1904CrossRefGoogle Scholar
  33. 33.
    Lin C, Zou Y, Qin J, Liu X, Jiang Y, Ke C, Zou Q (2013) Hierarchical classification of protein folds using a novel ensemble classifier. PLoS ONE 8(2):e56499CrossRefGoogle Scholar
  34. 34.
    Sueoka N (1961) Correlation between base composition of deoxyribonucleic acid and amino acid composition of protein. Proc Natl Acad Sci 47(8):1141–1149CrossRefGoogle Scholar
  35. 35.
    Sorimachi K (1999) Evolutionary changes reflected by the cellular amino acid composition. Amino Acids 17(2):207–226CrossRefGoogle Scholar
  36. 36.
    Sorimachi K, Okayasu T (2014) Classification of non-animals and invertebrates based on amino acid composition of complete mitochondrial genomes. Int J Biol 6(1):1Google Scholar
  37. 37.
    Mamarabadi M, Tokhmechi B (2012) Signal processing approaches as novel tools for the clus-tering of N-acetyl-β-d-glucosaminidases. Iran J Biotechnol 10(3):1Google Scholar
  38. 38.
    Mamarabadi M, Rohani A (2017) Clustering of fungal hexosaminidase enzymes based on free alignment method using MLP neural network. Neural Comput Appl 1:1–11Google Scholar
  39. 39.
    Satpathy R, Behera R, Padhi SK, Guru RK (2013) Computational phylogenetic study and data mining approach to laccase enzyme sequences. J Phylogen Evol Biol 1:108CrossRefGoogle Scholar
  40. 40.
    Ozbudak O, Dokur Z (2014) Protein fold classification using Kohonen’s self-organizing map. In: IWBBIO, pp 903–911Google Scholar
  41. 41.
    Kumar R, Srivastava A, Kumari B, Kumar M (2015) Prediction of β-lactamase and its class by Chou’s pseudo-amino acid composition and support vector machine. J Theor Biol 365:96–103MathSciNetzbMATHCrossRefGoogle Scholar
  42. 42.
    Guyon I, Elisseeff A (2003) An introduction to variable and feature selection. J Mach Learn Res 3:1157–1182zbMATHGoogle Scholar
  43. 43.
    Tan P, Steinbach M, Kumar V (2005) Introduction to data mining. Addison-Wesley Longman Publishing Co., Inc., BostonGoogle Scholar
  44. 44.
    Abdi H, Williams LJ (2010) Principal component analysis. Wiley Interdiscip Rev Comput Stat 2(4):433–459CrossRefGoogle Scholar
  45. 45.
    López M, Ramírez J, Górriz J, Salas-Gonzalez D, Alvarez I, Segovia F, Puntonet C (2009) Automatic tool for Alzheimer’s disease diagnosis using PCA and Bayesian classification rules. Electron Lett 45(8):389–391CrossRefGoogle Scholar
  46. 46.
    Suganthy M, Ramamoorthy P (2012) Principal component analysis based feature extraction, morphological edge detection and localization for fast iris recognition. J Comput Sci 8(9):1428CrossRefGoogle Scholar
  47. 47.
    Li Y, Xia J, Zhang S, Yan J, Ai X, Dai K (2012) An efficient intrusion detection system based on support vector machines and gradually feature removal method. Expert Syst Appl 39(1):424–430CrossRefGoogle Scholar
  48. 48.
    Vieira SM, Mendonça LF, Farinha GJ, Sousa JM (2013) Modified binary PSO for feature selection using SVM applied to mortality prediction of septic patients. Appl Soft Comput 13(8):3494–3504CrossRefGoogle Scholar
  49. 49.
    Sprent P, Smeeton NC (2016) Applied nonparametric statistical methods. CRC Press, BostonzbMATHGoogle Scholar
  50. 50.
    Refaeilzadeh P, Tang L, Liu H (2009) Cross-validation. In: Encyclopedia of database systems. Springer, pp 532–538Google Scholar
  51. 51.
    Simon RM, Subramanian J, Li M-C, Menezes S (2011) Using cross-validation to evaluate predictive accuracy of survival risk classifiers based on high-dimensional data. Brief Bioinform 12(3):203–214CrossRefGoogle Scholar
  52. 52.
    Varma S, Simon R (2006) Bias in error estimation when using cross-validation for model selection. BMC Bioinf 7(1):91CrossRefGoogle Scholar
  53. 53.
    Kohavi R (1995) A study of cross-validation and bootstrap for accuracy estimation and model selection. In: Ijcai, vol 2. Stanford, CA, pp 1137–1145Google Scholar
  54. 54.
    Saini H, Raicar G, Dehzangi A, Lal S, Sharma A (2015) Subcellular localization for Gram positive and Gram negative bacterial proteins using linear interpolation smoothing model. J Theor Biol 386:25–33CrossRefGoogle Scholar
  55. 55.
    Lin W-J, Chen JJ (2012) Class-imbalanced classifiers for high-dimensional data. Brief Bioinf 14:13CrossRefGoogle Scholar
  56. 56.
    May RJ, Maier HR, Dandy GC (2010) Data splitting for artificial neural networks using SOM-based stratified sampling. Neural Netw 23(2):283–294CrossRefGoogle Scholar
  57. 57.
    Li D, Deogun JS, Wang K (2007) Gene function classification using fuzzy k-nearest neighbor approach. In: IEEE international conference on granular computing, 2007. GRC 2007. IEEE, pp 644Google Scholar
  58. 58.
    Wu X, Kumar V, Quinlan JR, Ghosh J, Yang Q, Motoda H, McLachlan GJ, Ng A, Liu B, Philip SY (2008) Top 10 algorithms in data mining. Knowl Inf Syst 14(1):1–37CrossRefGoogle Scholar
  59. 59.
    Farid DM, Al-Mamun MA, Manderick B, Nowe A (2016) An adaptive rule-based classifier for mining big biological data. Expert Syst Appl 64:305–316CrossRefGoogle Scholar
  60. 60.
    Vapnik V (2013) The nature of statistical learning theory. Springer, BerlinzbMATHGoogle Scholar
  61. 61.
    Hsu C-W, Lin C-J (2002) A comparison of methods for multiclass support vector machines. IEEE Trans Neural Netw 13(2):415–425CrossRefGoogle Scholar
  62. 62.
    Guyon I, Weston J, Barnhill S, Vapnik V (2002) Gene selection for cancer classification using support vector machines. Mach Learn 46(1–3):389–422zbMATHCrossRefGoogle Scholar
  63. 63.
    Shen Q, Shi W-M, Kong W, Ye B-X (2007) A combination of modified particle swarm optimization algorithm and support vector machine for gene selection and tumor classification. Talanta 71(4):1679–1683CrossRefGoogle Scholar
  64. 64.
    Kohonen T (1990) The self-organizing map. Proc IEEE 78(9):1464–1480CrossRefGoogle Scholar
  65. 65.
    Mortazavi A, Pepke S, Jansen C, Marinov GK, Ernst J, Kellis M, Hardison RC, Myers RM, Wold BJ (2013) Integrating and mining the chromatin landscape of cell-type specificity using self-organizing maps. Genome Res 23(12):2136–2148CrossRefGoogle Scholar
  66. 66.
    Yan A, Nie X, Wang K, Wang M (2013) Classification of Aurora kinase inhibitors by self-organizing map (SOM) and support vector machine (SVM). Eur J Med Chem 61:73–83CrossRefGoogle Scholar
  67. 67.
    Nam Y, Koh S-H, Jeon S-J, Youn H-J, Park Y-S, Choi WI (2015) Hazard rating of coastal pine forests for a black pine bast scale using self-organizing map (SOM) and random forest approaches. Ecol Inf 29:206–213CrossRefGoogle Scholar
  68. 68.
    Cho S-B, Won H-H (2003) Data mining for gene expression profiles from DNA microarray. Int J Softw Eng Knowl Eng 13(06):593–608CrossRefGoogle Scholar
  69. 69.
    Kim K-J, Cho S-B (2006) Ensemble classifiers based on correlation analysis for DNA microarray classification. Neurocomputing 70(1):187–199CrossRefGoogle Scholar
  70. 70.
    Weng C-H, Huang TC-K, Han R-P (2016) Disease prediction with different types of neural network classifiers. Telemat Inform 33(2):277–292CrossRefGoogle Scholar
  71. 71.
    Fawcett T (2006) An introduction to ROC analysis. Pattern Recogn Lett 27(8):861–874MathSciNetCrossRefGoogle Scholar
  72. 72.
    Youden WJ (1950) Index for rating diagnostic tests. Cancer 3(1):32–35CrossRefGoogle Scholar
  73. 73.
    Montgomery DC (2008) Design and analysis of experiments. Wiley, New YorkGoogle Scholar
  74. 74.
    Opricovic S, Tzeng G-H (2004) Compromise solution by MCDM methods: a comparative analysis of VIKOR and TOPSIS. Eur J Oper Res 156(2):445–455zbMATHCrossRefGoogle Scholar
  75. 75.
    Peng Y, Wang G, Kou G, Shi Y (2011) An empirical study of classification algorithm evaluation for financial risk prediction. Appl Soft Comput 11(2):2906–2915CrossRefGoogle Scholar
  76. 76.
    Kou G, Lu Y, Peng Y, Shi Y (2012) Evaluation of classification algorithms using MCDM and rank correlation. Int J Inf Technol Decis Mak 11(01):197–225CrossRefGoogle Scholar
  77. 77.
    Beura S, Majhi B, Dash R (2015) Mammogram classification using two dimensional discrete wavelet transform and gray-level co-occurrence matrix for detection of breast cancer. Neurocomputing 154:1–14CrossRefGoogle Scholar
  78. 78.
    Yousefi MR, Dougherty ER (2012) Performance reproducibility index for classification. Bioinformatics 28(21):2824–2833CrossRefGoogle Scholar
  79. 79.
    Howley T, Madden MG, O’Connell M-L, Ryder AG (2006) The effect of principal component analysis on machine learning accuracy with high-dimensional spectral data. Knowl Based Syst 19(5):363–370CrossRefGoogle Scholar
  80. 80.
    Erkmen B, Yıldırım T (2008) Improving classification performance of sonar targets by applying general regression neural network with PCA. Expert Syst Appl 35(1):472–475CrossRefGoogle Scholar
  81. 81.
    Kumar R, Goyal MK, Ahmed P, Kumar A (2012) Unconstrained handwritten numeral recognition using majority voting classifier. In: 2012 2nd IEEE international conference on Parallel distributed and grid computing (PDGC), 2012. IEEE, pp 284–289Google Scholar
  82. 82.
    Jafari N, Chodorowski A (2012) Histology-based oral lesion classification. In: 2012 20th Iranian conference on electrical engineering (ICEE). IEEE, pp 1612–1617Google Scholar
  83. 83.
    Cunningham P, Delany SJ (2007) k-Nearest neighbour classifiers. Multiple Classif Syst 34:1–17Google Scholar
  84. 84.
    Jiang S, Pang G, Wu M, Kuang L (2012) An improved K-nearest-neighbor algorithm for text categorization. Expert Syst Appl 39(1):1503–1509CrossRefGoogle Scholar
  85. 85.
    Mu Y, Ding W, Tao D, Stepinski TF (2011) Biologically inspired model for crater detection. In: The 2011 international joint conference on neural networks (IJCNN). IEEE, pp 2487–2494Google Scholar
  86. 86.
    Ahmad J, Fiaz M, Kwon S-I, Sodanil M, Vo B, Baik SW (2016) Gender identification using MFCC for telephone applications—a comparative study. arXiv preprint arXiv:160101577
  87. 87.
    Li S, Wu X, Tan M (2008) Gene selection using hybrid particle swarm optimization and genetic algorithm. Soft Comput 12(11):1039–1048CrossRefGoogle Scholar
  88. 88.
    Zhang Y, Wang S, Ji G, Dong Z (2013) An MR brain images classifier system via particle swarm optimization and kernel support vector machine. Sci World J 2013:130–134Google Scholar
  89. 89.
    Figueiredo J, Santos CP, Urendes E, Pons JL, Moreno JC (2015) Implementation of feature extraction methods and support vector machine for classification of partial body weight supports in overground robot-aided walking. In: 2015 7th international IEEE/EMBS conference on neural engineering (NER), IEEE, pp 763–766Google Scholar
  90. 90.
    Ozkan H (2016) A comparison of classification methods for telediagnosis of Parkinson’s disease. Entropy 18(4):115CrossRefGoogle Scholar
  91. 91.
    Petrov N, Georgieva A, Jordanov I (2013) Self-organizing maps for texture classification. Neural Comput Appl 22(7–8):1499–1508CrossRefGoogle Scholar
  92. 92.
    George AJ, Gopakumar G, Pradhan M, Nazeer KA, Palakal MJ (2015) A self organizing map-harmony search hybrid algorithm for clustering biological data. In: 2015 IEEE international conference on signal processing, informatics, communication and energy systems (SPICES), IEEE, pp 1–5Google Scholar
  93. 93.
    Kumar D, Rai C, Kumar S (2005) Face recognition using self-organizing map and principal component analysis. In: International conference on neural networks and brain. ICNN&B’05. IEEE, pp 1469–1473Google Scholar
  94. 94.
    Cho S-B, Ryu J (2002) Classifying gene expression data of cancer using classifier ensemble with mutually exclusive features. Proc IEEE 90(11):1744–1753CrossRefGoogle Scholar
  95. 95.
    Shen H-B, Chou K-C (2006) Ensemble classifier for protein fold pattern recognition. Bioinformatics 22(14):1717–1722CrossRefGoogle Scholar
  96. 96.
    Aram RZ, Charkari NM (2015) A two-layer classification framework for protein fold recognition. J Theor Biol 365:32–39MathSciNetzbMATHCrossRefGoogle Scholar
  97. 97.
    Ding CH, Dubchak I (2001) Multi-class protein fold recognition using support vector machines and neural networks. Bioinformatics 17(4):349–358CrossRefGoogle Scholar
  98. 98.
    Li T, Zhang C, Ogihara M (2004) A comparative study of feature selection and multiclass classification methods for tissue classification based on gene expression. Bioinformatics 20(15):2429–2437CrossRefGoogle Scholar
  99. 99.
    Subashini T, Ramalingam V, Palanivel S (2009) Breast mass classification based on cytological patterns using RBFNN and SVM. Expert Syst Appl 36(3):5284–5290CrossRefGoogle Scholar
  100. 100.
    Li L, Wu Y, Ye M (2015) Experimental comparisons of multi-class classifiers. Informatica 39(1):71MathSciNetGoogle Scholar
  101. 101.
    Banerjee S, Anura A, Chakrabarty J, Sengupta S, Chatterjee J (2016) Identification and functional assessment of novel gene sets towards better understanding of dysplasia associated oral carcinogenesis. Gene Rep 4:131–138CrossRefGoogle Scholar
  102. 102.
    Waris M, Ahmad K, Kabir M, Hayat M (2016) Identification of DNA binding proteins using evolutionary profiles position specific scoring matrix. Neurocomputing 199:154–162CrossRefGoogle Scholar

Copyright information

© The Natural Computing Applications Forum 2018

Authors and Affiliations

  1. 1.Department of Biosystems Engineering, Faculty of AgricultureFerdowsi University of MashhadMashhadIran
  2. 2.Department of Plant Protection, Faculty of AgricultureFerdowsi University of MashhadMashhadIran

Personalised recommendations