Skip to main content

A Clustering Approach to Identify Candidates to Housekeeping Genes Based on RNA-seq Data

  • Conference paper
  • First Online:
Advances in Bioinformatics and Computational Biology (BSB 2019)

Abstract

Housekeeping genes (HKGs), are essential for gene expression based studies performed through Reverse Transcriptase-polymerase Chain Reaction (RT-qPCR). These genes are related with the basic cellular processes that are essential for cell maintenance, survival and function. Thus, HKGs should be expressed in all cells of an organism regardless of the tissue type, cell state or cell condition. High-throughput technologies, including RNA sequencing (RNA-seq), are used to study and identify these types of genes. RNA-seq is a high-throughput method that allows the measurement of gene expression profiles in a target tissue or an isolated cell. Moreover, machine learning methods are routinely applied in different genomics related areas to enable the interpretation of large datasets, including those related to gene expression. This study reports a new machine learning based approach to identify candidate HKGs in silico from RNA-seq gene expression data. The approach enabled the identification of stable HKGs candidates in RNA-seq data from Corynebacterium pseudotuberculosis. These genes showed stable expression under different stress conditions as well as low variation index and fold changes. Furthermore, some of these genes were already reported in the literature as HKGs or HKGs candidates for the same or other bacterial organisms, which reinforced the accuracy of the proposed method. We present a novel approach based on K-means algorithm, internal metrics and machine learning methods that can identify stable housekeeping genes from gene expression data with high accuracy and efficiency.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Andritsos, P., et al.: Data clustering techniques. Rapport technique. University of Toronto. Department of Computer Science (2002)

    Google Scholar 

  2. Berkhin, P.: A survey of clustering data mining techniques. In: Kogan, J., Nicholas, C., Teboulle, M. (eds.) Grouping Multidimensional Data, pp. 25–71. Springer, Heidelberg (2006)

    Chapter  Google Scholar 

  3. Bolshakova, N., Azuaje, F.: Cluster validation techniques for genome expression data. Sig. Process. 83(4), 825–833 (2003). https://doi.org/10.1016/S0165-1684(02)00475-9

    Article  MATH  Google Scholar 

  4. Brock, G., Pihur, V., Datta, S.: clValid: an R package for cluster validation. J. Stat. Softw. 25, 1–32 (2008)

    Article  Google Scholar 

  5. Brun, M., et al.: Model-based evaluation of clustering validation measures. Pattern Recogn. 40(3), 807–824 (2007)

    Article  Google Scholar 

  6. Charrad, M., Ghazzali, N., Boiteau, V., Niknafs, A.: NbClust: an R package for determining the relevant number of clusters in a data set. J. Stat. Softw., Art. 61(6), 1–36 (2014). https://doi.org/10.18637/jss.v061.i06

  7. Chen, W.H., Minguez, P., Lercher, M.J., Bork, P.: OGEE: an online gene essentiality database. Nucleic Acids Res. 40(D1), D901–D906 (2011)

    Article  Google Scholar 

  8. Ching, T., et al.: Opportunities and obstacles for deep learning in biology and medicine. J. R. Soc. Interface 15(141), 20170387 (2018)

    Article  Google Scholar 

  9. Dalton, L., Ballarin, V., Brun, M.: Clustering algorithms: on learning, validation, performance, and applications to genomics. Curr. Genomics 10(6), 430–445 (2009). https://doi.org/10.2174/138920209789177601

    Article  Google Scholar 

  10. De Ferrari, L., Aitken, S.: Mining housekeeping genes with a Naive Bayes classifier. BMC Genomics 7(1), 277 (2006). https://doi.org/10.1186/1471-2164-7-277

    Article  Google Scholar 

  11. Dheda, K., Huggett, J.F., Bustin, S.A., Johnson, M.A., Rook, G., Zumla, A.: Validation of housekeeping genes for normalizing RNA expression in real-time PCR. BioTechniques 37(1), 112–119 (2004)

    Article  Google Scholar 

  12. Dong, B., et al.: Predicting housekeeping genes based on Fourier analysis. PLoS One 6(6), e21012 (2011)

    Article  Google Scholar 

  13. Eisenberg, E., Levanon, E.Y.: Human housekeeping genes, revisited. Trends Genet. 29(10), 569–574 (2013)

    Article  Google Scholar 

  14. Fayyad, U.M., Piatetsky-Shapiro, G., Smyth, P.: Knowledge discovery and data mining: towards a unifying framework. In: Proceedings of 2nd International Conference on Knowledge Discovery and Data Mining, Portland, OR, pp. 82–88 (1996) https://doi.org/10.1.1.27.363

  15. Ghazzali, N.: NbClust: an R package for determining the relevant number of clusters in a data set. J. Stat. Softw. 61(6), 1–36 (2014)

    Google Scholar 

  16. Halkidi, M., Vazirgiannis, M.: Clustering validity assessment: finding the optimal partitioning of a data set. In: Proceedings 2001 IEEE International Conference on Data Mining, pp. 187–194. IEEE (2001)

    Google Scholar 

  17. Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The WEKA data mining software. ACM SIGKDD Explorations 11(1), 10–18 (2009). https://doi.org/10.1145/1656274.1656278

    Article  Google Scholar 

  18. Han, J., Kamber, M., Pei, J.: Data Mining: Concepts and Tecniques, 3rd edn. Morgan Kaufmann/Elsevier, Walthan (2011)

    Google Scholar 

  19. de Jonge, H.J.M., et al.: Evidence based selection of housekeeping genes. PLoS One 2(9), 1–5 (2007). https://doi.org/10.1371/journal.pone.0000898

    Article  Google Scholar 

  20. Kovács, F., Legány, C., Babos, A.: Cluster validity measurement techniques. In: Proceedings of the 6th International Symposium of Hungarian Researchers on Computational Intelligence, pp. 1–11 (2005)

    Google Scholar 

  21. Kozera, B., Rapacz, M.: Reference genes in real-time PCR. J. Appl. Genet. 54(4), 391–406 (2013)

    Article  Google Scholar 

  22. Lercher, M.J., Urrutia, A.O., Hurst, L.D.: Clustering of housekeeping genes provides a unified model of gene order in the human genome. Nat. Genet. 31(2), 180–183 (2002). https://doi.org/10.1038/ng887

    Article  Google Scholar 

  23. Libbrecht, M.W., Noble, W.S.: Machine learning applications in genetics and genomics. Nat. Rev. Genet. 16(6), 321–332 (2015). https://doi.org/10.1038/nrg3920

    Article  Google Scholar 

  24. Lin, Y., et al.: Evaluating stably expressed genes in single cells. bioRxiv p. 229815 (2018)

    Google Scholar 

  25. Liu, P., Si, Y.: Cluster analysis of RNA-sequencing data. In: Datta, S., Nettleton, D. (eds.) Statistical Analysis of Next Generation Sequencing Data. FPSS, pp. 191–217. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-07212-8_10

    Chapter  Google Scholar 

  26. Maimon, O., Rokach, L.: Introduction to knowledge discovery and data mining. In: Maimon, O., Rokach, L. (eds.) Data Mining and Knowledge Discovery Handbook, pp. 1–15. Springer, Boston (2009). https://doi.org/10.1007/978-0-387-09823-4_1

    Chapter  MATH  Google Scholar 

  27. Oyelade, J., et al.: Clustering algorithms: their application to gene expression data. Bioinform. Biol. Insights 10, BBI-S38316 (2016)

    Google Scholar 

  28. Pinto, A.C., et al.: Differential transcriptional profile of Corynebacterium pseudotuberculosis in response to abiotic stresses. BMC Genomics 15(1), 14 (2014)

    Article  Google Scholar 

  29. R Core Team: R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria (2018). https://www.R-project.org/

  30. Rao, J., Liu, W., Xie, H.: A new method to identify housekeeping genes and tissue special genes. In: International Conference on Biomedical and Biological Engineering. Atlantis Press (2016)

    Google Scholar 

  31. Rendón, E., Abundez, I., Arizmendi, A., Quiroz, E.M.: Internal versus external cluster validation indexes. Int. J. Comput. Commun. 5(1), 27–34 (2011)

    Google Scholar 

  32. Rocha, D.J.P., Santos, C.S., Pacheco, L.G.C.: Bacterial reference genes for gene expression studies by RT-qPCR: survey and analysis. Antonie Van Leeuwenhoek 108(3), 685–693 (2015). https://doi.org/10.1007/s10482-015-0524-1

    Article  Google Scholar 

  33. Ross, I., Gentleman, R.: R: a language for data analysis and graphics. J. Comput. Graph. Stat. 5(3), 299–314 (1996)

    Google Scholar 

  34. Si, Y., Liu, P., Li, P., Brutnell, T.P.: Model-based clustering for RNA-seq data. Bioinformatics 30(2), 197–205 (2014). https://doi.org/10.1093/bioinformatics/btt632

    Article  Google Scholar 

  35. Silva, A., et al.: Complete genome sequence of corynebacterium pseudotuberculosis I19, a strain isolated from a cow in israel with bovine mastitis. J. Bacteriol. 193(1), 323–324 (2011)

    Article  Google Scholar 

  36. Soares, S.C., et al.: Genome sequence of Corynebacterium pseudotuberculosis biovar equi strain 258 and prediction of antigenic targets to improve biotechnological vaccine production. J. Biotechnol. 167(2), 135–141 (2013). https://doi.org/10.1016/j.jbiotec.2012.11.003

    Article  Google Scholar 

  37. Treangen, T.J., Salzberg, S.L.: Repetitive DNA and next-generation sequencing: computational challenges and solutions. Nat. Rev. Genet. 13(1), 36–46 (2013). https://doi.org/10.1038/nrg3117.Repetitive

    Article  Google Scholar 

  38. Vandesompele, J., et al.: Accurate normalization of real-time quantitative RT-PCR data by geometric averaging of multiple internal control genes. Genome Biol. 3(711), 31–34 (2002). https://doi.org/10.1186/gb-2002-3-7-research0034

    Article  Google Scholar 

  39. Vieira, A., et al.: Comparative validation of conventional and RNA-Seq data-derived reference genes for QPCR expression studies of colletotrichum Kahawae. PLoS One 11(3), e0150651 (2016)

    Article  Google Scholar 

  40. Zhao, Y., Wu, J., Yang, J., Sun, S., Xiao, J., Yu, J.: PGAP: pan-genomes analysis pipeline. Bioinformatics 28(3), 416–418 (2012). https://doi.org/10.1093/bioinformatics/btr655

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Rommel T. J. Ramos .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Franco, E.F. et al. (2020). A Clustering Approach to Identify Candidates to Housekeeping Genes Based on RNA-seq Data. In: Kowada, L., de Oliveira, D. (eds) Advances in Bioinformatics and Computational Biology. BSB 2019. Lecture Notes in Computer Science(), vol 11347. Springer, Cham. https://doi.org/10.1007/978-3-030-46417-2_8

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-46417-2_8

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-46416-5

  • Online ISBN: 978-3-030-46417-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics