Abstract
Housekeeping genes (HKGs), are essential for gene expression based studies performed through Reverse Transcriptase-polymerase Chain Reaction (RT-qPCR). These genes are related with the basic cellular processes that are essential for cell maintenance, survival and function. Thus, HKGs should be expressed in all cells of an organism regardless of the tissue type, cell state or cell condition. High-throughput technologies, including RNA sequencing (RNA-seq), are used to study and identify these types of genes. RNA-seq is a high-throughput method that allows the measurement of gene expression profiles in a target tissue or an isolated cell. Moreover, machine learning methods are routinely applied in different genomics related areas to enable the interpretation of large datasets, including those related to gene expression. This study reports a new machine learning based approach to identify candidate HKGs in silico from RNA-seq gene expression data. The approach enabled the identification of stable HKGs candidates in RNA-seq data from Corynebacterium pseudotuberculosis. These genes showed stable expression under different stress conditions as well as low variation index and fold changes. Furthermore, some of these genes were already reported in the literature as HKGs or HKGs candidates for the same or other bacterial organisms, which reinforced the accuracy of the proposed method. We present a novel approach based on K-means algorithm, internal metrics and machine learning methods that can identify stable housekeeping genes from gene expression data with high accuracy and efficiency.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Andritsos, P., et al.: Data clustering techniques. Rapport technique. University of Toronto. Department of Computer Science (2002)
Berkhin, P.: A survey of clustering data mining techniques. In: Kogan, J., Nicholas, C., Teboulle, M. (eds.) Grouping Multidimensional Data, pp. 25–71. Springer, Heidelberg (2006)
Bolshakova, N., Azuaje, F.: Cluster validation techniques for genome expression data. Sig. Process. 83(4), 825–833 (2003). https://doi.org/10.1016/S0165-1684(02)00475-9
Brock, G., Pihur, V., Datta, S.: clValid: an R package for cluster validation. J. Stat. Softw. 25, 1–32 (2008)
Brun, M., et al.: Model-based evaluation of clustering validation measures. Pattern Recogn. 40(3), 807–824 (2007)
Charrad, M., Ghazzali, N., Boiteau, V., Niknafs, A.: NbClust: an R package for determining the relevant number of clusters in a data set. J. Stat. Softw., Art. 61(6), 1–36 (2014). https://doi.org/10.18637/jss.v061.i06
Chen, W.H., Minguez, P., Lercher, M.J., Bork, P.: OGEE: an online gene essentiality database. Nucleic Acids Res. 40(D1), D901–D906 (2011)
Ching, T., et al.: Opportunities and obstacles for deep learning in biology and medicine. J. R. Soc. Interface 15(141), 20170387 (2018)
Dalton, L., Ballarin, V., Brun, M.: Clustering algorithms: on learning, validation, performance, and applications to genomics. Curr. Genomics 10(6), 430–445 (2009). https://doi.org/10.2174/138920209789177601
De Ferrari, L., Aitken, S.: Mining housekeeping genes with a Naive Bayes classifier. BMC Genomics 7(1), 277 (2006). https://doi.org/10.1186/1471-2164-7-277
Dheda, K., Huggett, J.F., Bustin, S.A., Johnson, M.A., Rook, G., Zumla, A.: Validation of housekeeping genes for normalizing RNA expression in real-time PCR. BioTechniques 37(1), 112–119 (2004)
Dong, B., et al.: Predicting housekeeping genes based on Fourier analysis. PLoS One 6(6), e21012 (2011)
Eisenberg, E., Levanon, E.Y.: Human housekeeping genes, revisited. Trends Genet. 29(10), 569–574 (2013)
Fayyad, U.M., Piatetsky-Shapiro, G., Smyth, P.: Knowledge discovery and data mining: towards a unifying framework. In: Proceedings of 2nd International Conference on Knowledge Discovery and Data Mining, Portland, OR, pp. 82–88 (1996) https://doi.org/10.1.1.27.363
Ghazzali, N.: NbClust: an R package for determining the relevant number of clusters in a data set. J. Stat. Softw. 61(6), 1–36 (2014)
Halkidi, M., Vazirgiannis, M.: Clustering validity assessment: finding the optimal partitioning of a data set. In: Proceedings 2001 IEEE International Conference on Data Mining, pp. 187–194. IEEE (2001)
Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The WEKA data mining software. ACM SIGKDD Explorations 11(1), 10–18 (2009). https://doi.org/10.1145/1656274.1656278
Han, J., Kamber, M., Pei, J.: Data Mining: Concepts and Tecniques, 3rd edn. Morgan Kaufmann/Elsevier, Walthan (2011)
de Jonge, H.J.M., et al.: Evidence based selection of housekeeping genes. PLoS One 2(9), 1–5 (2007). https://doi.org/10.1371/journal.pone.0000898
Kovács, F., Legány, C., Babos, A.: Cluster validity measurement techniques. In: Proceedings of the 6th International Symposium of Hungarian Researchers on Computational Intelligence, pp. 1–11 (2005)
Kozera, B., Rapacz, M.: Reference genes in real-time PCR. J. Appl. Genet. 54(4), 391–406 (2013)
Lercher, M.J., Urrutia, A.O., Hurst, L.D.: Clustering of housekeeping genes provides a unified model of gene order in the human genome. Nat. Genet. 31(2), 180–183 (2002). https://doi.org/10.1038/ng887
Libbrecht, M.W., Noble, W.S.: Machine learning applications in genetics and genomics. Nat. Rev. Genet. 16(6), 321–332 (2015). https://doi.org/10.1038/nrg3920
Lin, Y., et al.: Evaluating stably expressed genes in single cells. bioRxiv p. 229815 (2018)
Liu, P., Si, Y.: Cluster analysis of RNA-sequencing data. In: Datta, S., Nettleton, D. (eds.) Statistical Analysis of Next Generation Sequencing Data. FPSS, pp. 191–217. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-07212-8_10
Maimon, O., Rokach, L.: Introduction to knowledge discovery and data mining. In: Maimon, O., Rokach, L. (eds.) Data Mining and Knowledge Discovery Handbook, pp. 1–15. Springer, Boston (2009). https://doi.org/10.1007/978-0-387-09823-4_1
Oyelade, J., et al.: Clustering algorithms: their application to gene expression data. Bioinform. Biol. Insights 10, BBI-S38316 (2016)
Pinto, A.C., et al.: Differential transcriptional profile of Corynebacterium pseudotuberculosis in response to abiotic stresses. BMC Genomics 15(1), 14 (2014)
R Core Team: R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria (2018). https://www.R-project.org/
Rao, J., Liu, W., Xie, H.: A new method to identify housekeeping genes and tissue special genes. In: International Conference on Biomedical and Biological Engineering. Atlantis Press (2016)
Rendón, E., Abundez, I., Arizmendi, A., Quiroz, E.M.: Internal versus external cluster validation indexes. Int. J. Comput. Commun. 5(1), 27–34 (2011)
Rocha, D.J.P., Santos, C.S., Pacheco, L.G.C.: Bacterial reference genes for gene expression studies by RT-qPCR: survey and analysis. Antonie Van Leeuwenhoek 108(3), 685–693 (2015). https://doi.org/10.1007/s10482-015-0524-1
Ross, I., Gentleman, R.: R: a language for data analysis and graphics. J. Comput. Graph. Stat. 5(3), 299–314 (1996)
Si, Y., Liu, P., Li, P., Brutnell, T.P.: Model-based clustering for RNA-seq data. Bioinformatics 30(2), 197–205 (2014). https://doi.org/10.1093/bioinformatics/btt632
Silva, A., et al.: Complete genome sequence of corynebacterium pseudotuberculosis I19, a strain isolated from a cow in israel with bovine mastitis. J. Bacteriol. 193(1), 323–324 (2011)
Soares, S.C., et al.: Genome sequence of Corynebacterium pseudotuberculosis biovar equi strain 258 and prediction of antigenic targets to improve biotechnological vaccine production. J. Biotechnol. 167(2), 135–141 (2013). https://doi.org/10.1016/j.jbiotec.2012.11.003
Treangen, T.J., Salzberg, S.L.: Repetitive DNA and next-generation sequencing: computational challenges and solutions. Nat. Rev. Genet. 13(1), 36–46 (2013). https://doi.org/10.1038/nrg3117.Repetitive
Vandesompele, J., et al.: Accurate normalization of real-time quantitative RT-PCR data by geometric averaging of multiple internal control genes. Genome Biol. 3(711), 31–34 (2002). https://doi.org/10.1186/gb-2002-3-7-research0034
Vieira, A., et al.: Comparative validation of conventional and RNA-Seq data-derived reference genes for QPCR expression studies of colletotrichum Kahawae. PLoS One 11(3), e0150651 (2016)
Zhao, Y., Wu, J., Yang, J., Sun, S., Xiao, J., Yu, J.: PGAP: pan-genomes analysis pipeline. Bioinformatics 28(3), 416–418 (2012). https://doi.org/10.1093/bioinformatics/btr655
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Franco, E.F. et al. (2020). A Clustering Approach to Identify Candidates to Housekeeping Genes Based on RNA-seq Data. In: Kowada, L., de Oliveira, D. (eds) Advances in Bioinformatics and Computational Biology. BSB 2019. Lecture Notes in Computer Science(), vol 11347. Springer, Cham. https://doi.org/10.1007/978-3-030-46417-2_8
Download citation
DOI: https://doi.org/10.1007/978-3-030-46417-2_8
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-46416-5
Online ISBN: 978-3-030-46417-2
eBook Packages: Computer ScienceComputer Science (R0)