A Revamp Approach for Training of HMM to Accelerate Classification of 16S rRNA Gene Sequences

  • Prakash ChoudharyEmail author
  • M. P. Kurhekar
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10990)


In the era of Information Technology, the field of Bioinformatics is rapidly growing with research in various related topics. The database of biological information has become much higher than its consumption. Automatic classification of biological information is one of the critical problems in Bioinformatics. Therefore, the critical issue is to regulate and manage the enormous amount of novel information to facilitate access to this useful and valuable biological information. The specific nucleus dilemma in classifying biological information is the annotation of various biological sequences with functional features. Annotation of the significant and rapidly increasing amount of genomic sequence data requires computational tools for classification of genes in DNA sequences. This paper presents a computational method for classification of highly conserved 16S rRNA biological sequences. We took Biological sequence classification as motivation to reveal a methodology that uses Hidden Markov Models (HMMs) to classify them. This paper explains the description of the algorithms used for implementing three phases of HMM (training, decoding, and evaluation) to classify sequences into clusters that have known similar functional properties. In the implementation of the training phase, we have addressed practical issues like initial parameter selection for HMM and computational weakness for the large data set. Later in the paper, we have shown that methodology presents a classification accuracy of 91% for Bacillus and 97% for Clostridia.


HMM Parameters estimation Biological sequence 16S rRNA Gene classification Bioinformatics 


  1. 1.
    Ferles, C., Beaufort, W.-S., Ferle, V.: Self-Organizing Hidden Markov Model Map (SOHMMM): biological sequence clustering and cluster visualization. Methods Mol. Biol. 1552, 83–101 (2017)CrossRefGoogle Scholar
  2. 2.
    Cole, J.R., et al.: Ribosomal Database Project: data and tools for high throughput rRNA analysis. Nucleic Acids Res. 42(Database issue), D633–D642 (2014). Scholar
  3. 3.
    Lu, X.X., Wu, W., Wang, M., Huang, Y.F.: 16S rRNA gene sequencing for pathogen identification from clinical specimens. Zhonghua Yi Xue Za Zhi 88(2), 123–126 (2008). Scholar
  4. 4.
    Gales, M., Young, S.: The application of hidden Markov models in speech recognition. Found. Trends Sig. Process. 1(3), 195–304 (2008). Scholar
  5. 5.
    Yoon, B.-J.: Hidden Markov models and their applications in biological sequence analysis. Curr. Genomics 10(6), 402–415 (2009). Scholar
  6. 6.
    Xing, Z., Jian, P., Eamonn, K.: A brief survey on sequence classification. SIGKDD Explor. 12(1), 40–48 (2010). Scholar
  7. 7.
    Kang, M.-S., Kim, H., Lee, S., Kim, M.H.: Feature-based gene classification and region clustering using gene expression grid data in mouse Hippocampal region. J. KIISE 43(1), 54–60 (2016). Scholar
  8. 8.
    Hawrylycz, M., et al.: Multi-scale correlation structure of gene expression in the brain. Neural Netw. 24(9), 933–942 (2011)CrossRefGoogle Scholar
  9. 9.
    Chandra, B., Gupta, M.: An efficient statistical feature selection approach for classification of gene expression data. 44(4), 529–535 (2011). Scholar
  10. 10.
    Abusamra, H.: A comparative study of feature selection and classification methods for gene expression data of glioma, 5–14 (2013). Scholar
  11. 11.
    Doungpan, N., Engchuan, W., Meechai, A., Fong, S., Chan, J.H.: Gene-Network-Based Feature Set (GNFS) for expression-based cancer classification. J. Med. Imaging Health Inform. 6(4), 1093–1101 (2016). Scholar
  12. 12.
    Baralis, E., Bruno, G., Fiori, A.: Measuring gene similarity by means of the classification distance. Knowl. Inf. Syst. 29(1), 81–101 (2011)CrossRefGoogle Scholar
  13. 13.
    Iqbal, M.J., Faye, I., Said, A.M., Belhaouari Samir, B.: A distance-based feature-encoding technique for protein sequence classification in bioinformatics. In: IEEE International Conference on Computational Intelligence and Cybernetics (CYBERNETICSCOM), pp. 1–5 (2013).
  14. 14.
    Kaya, H., Gunduz Oguducu, S.: A distance based time series classification framework. Inf. Syst. (2015). Scholar
  15. 15.
    Chen, H., Zhang, Y., Gutmanb, I.: A kernel-based clustering method for gene selection with gene expression data. J. Biomed. Inform. 12–20 (2016). Scholar
  16. 16.
    Wang, S., Li, X., Zhang, S.: Neighborhood rough set model based gene selection for multi-subtype tumor classification. In: Huang, D.-S., Wunsch, D.C., Levine, D.S., Jo, K.-H. (eds.) ICIC 2008. LNCS, vol. 5226, pp. 146–158. Springer, Heidelberg (2008). Scholar
  17. 17.
    Bauer, S., Robinson, P.N., Gagneur, J.: Model-based gene set analysis for Bioconductor. Bioinformatics 27(13), 1882–1883 (2011). Scholar
  18. 18.
    Bauer, S., Gagneur, J., Robinson, P.N.: Going Bayesian: model-based gene set analysis of genome-scale data. Nucleic Acids Res. 38(11), 3523–3532 (2010). Scholar
  19. 19.
    Guo, P., et al.: Gene expression profile based classification models of psoriasis. Genomics 103(1), 48–55 (2014). Scholar
  20. 20.
    Onan, A., Korukolu, S.: A feature selection model based on genetic rank aggregation for text sentiment classification. 43(1), 25–38 (2015). Scholar
  21. 21.
    Saengsiri, P., Meesad, P., Wichian, S.N., Herwig, U.: Classification models based-on incremental learning algorithm and feature selection on gene expression data. In: 8th Electrical Engineering/Electronics, Computer, Telecommunications and Information Technology (ECTI) Association of Thailand - Conference, pp. 426–429 (2011).
  22. 22.
    Welch, L.: Hidden Markov models and the Baum-Welch algorithm. IEEE Inf. Theory Soc. Newsl. 53(4), 10–13 (2003)Google Scholar
  23. 23.
    Karplus, K., et al.: Predicting protein structure using hidden Markov models. Proteins 1, 134–139 (2007)Google Scholar
  24. 24.
    Yakhnenko, O., Silvescu, A., Honavar, V.: Discriminatively trained Markov model for sequence classification. In: Fifth IEEE International Conference on Data Mining, pp. 1–8 (2005).
  25. 25.
    Srivastava, P.K., Desai, D.K., Nandi, S., Lynn, A.M.: HMM-ModE-Improved classification using profile hidden Markov models by optimizing the discrimination threshold and modifying emission probabilities with negative training sequences. BMC Bioinform. (2007). Scholar
  26. 26.
    Camproux, A.C., Tuffery, P., Chevrolat, J.P., Boisvieux, J.F., Hazout, S.: Hidden Markov model approach for identifying the modular framework of the protein backbone. Protein Eng. 12(12), 1063–1073 (1999)CrossRefGoogle Scholar
  27. 27.
    Sonnhammer, E.L.L., Eddy, S.R., Birney, E., Durbin, R.: Multiple sequence alignments and HMM-profiles of protein domains. Nucleic Acids Res. 26(1), 320–322 (1998)CrossRefGoogle Scholar
  28. 28.
    Di Francesco, V., Garnier, J., Munson, P.J.: Protein topology recognition from secondary structure sequences: application of the hidden Markov models to the alpha class proteins. J. Mol. Biol. 267(2), 446–463 (1997)CrossRefGoogle Scholar
  29. 29.
    Liu, T., Lemeire, J., Yang, L.: Proper initialization of Hidden Markov models for industrial applications. In: IEEE China Summit and International Conference on Signal and Information Processing (ChinaSIP), pp. 490–494 (2014).
  30. 30.
    Mann, T.P.: Numerically stable Hidden Markov Model implementation (2006)Google Scholar
  31. 31.
    Tatavarty, U.R.: Implementation of numerically stable hidden Markov model. UNLV Theses, Dissertations, Professional Papers, and Capstones. 1018 (2011).
  32. 32.
    Fu, B.: Computer architecture. Fall Project Report (2009)Google Scholar
  33. 33.
    Jose, S., Nair, P., Biju, V.G., Mathew, B.B., Prashanth, C.M.: Hidden Markov model: application towards genomic analysis. In: International Conference on Circuit, Power and Computing Technologies (ICCPCT), pp. 1–7. IEEE (2016).
  34. 34.
    Vijayabaskar, M.S.: Introduction to hidden Markov models and its applications in biology. In: Westhead, D.R., Vijayabaskar, M.S. (eds.) Hidden Markov Models: Methods and Protocols, Methods in Molecular Biology, vol. 1552 (2017)Google Scholar
  35. 35.
    Wang, Q., Garrity, G.M., Tiedje, J.M., Cole, J.R.: Nave Bayesian classifier for rapid assignment of rRNA sequences into the new bacterial taxonomy. Appl. Environ. Microbiol. 73(16), 61–67 (2007)CrossRefGoogle Scholar
  36. 36.
    Ghosh, T.S., Gajjalla, P., Mohammed, M.H., Mande, S.S.: C16S A Hidden Markov Model based algorithm for taxonomic classification of 16S rRNA gene sequences. Genomics 99(4), 195–201 (2012). Scholar
  37. 37.
    Janda, J.M., Abbott, S.L.: 16S rRNA gene sequencing for bacterial identification in the diagnostic laboratory: Pluses, Perils, and Pitfalls. J. Clin. Microbiol. 45(9), 2761–2764 (2007). Scholar
  38. 38.
    Fontana, C., Favaro, M., Pelliccioni, M., Pistoia, E.S., Favalli, C.: Use of the MicroSeq 16S rRNA gene based sequencing for identification of bacterial isolates that commercial automated systems failed to identify correctly. J. Clin. Microbiol. 43(2), 615–619 (2005)CrossRefGoogle Scholar
  39. 39.
    Patel, J.B.: 16S rRNA gene sequencing for bacterial pathogen identification in the clinical laboratory. Mol. Diagn. 6(4), 313–321 (2001)MathSciNetCrossRefGoogle Scholar
  40. 40.
    Mizrahi-Man, O., Davenport, E.R., Gilad, Y.: Taxonomic classification of bacterial 16S rRNA genes using short sequencing reads: evaluation of effective study designs. PLoS ONE 8(1), e53608 (2013). Scholar
  41. 41.
    Song, Y., Liu, C., BolaÅos, M., Lee, J., McTeague, M., Finegold, S.M.: Evaluation of 16S rRNA sequencing and reevaluation of a short biochemical scheme for identification of clinically significant Bacteroides species. J. Clin. Microbiol. 43(4), 1531–1537 (2005)CrossRefGoogle Scholar
  42. 42.
    Heikens, E., Fleer, A., Paauw, A., Florijn, A., Fluitt, A.C.: Comparison of genotypic and phenotypic methods for species-level identification of clinical isolates of coagulase-negative staphylococci. J. Clin. Microbiol. 43(5), 2286–2290 (2005)CrossRefGoogle Scholar
  43. 43.
    Bosshard, P.P., Zbinden, R., Abels, S., Bddinghaus, B., Altwegg, M., Bttger, E.C.: 16S rRNA gene sequencing versus the API 20 NE system and the VITEK 2 ID-GNB card for identification of nonfermenting Gram-negative bacteria in the clinical laboratory. J. Clin. Microbiol. 44(4), 1359–1366 (2006)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag GmbH Germany, part of Springer Nature 2018

Authors and Affiliations

  1. 1.Department of Computer Science and EngineeringNational Institute of Technology ManipurImphalIndia
  2. 2.Department of Computer Science and EngineeringVisvesvaraya National Institute of Technology NagpurNagpurIndia

Personalised recommendations