A Clustering Approach to Identify Candidates to Housekeeping Genes Based on RNA-seq Data

Franco, Edian F.; Maués, Dener; Alves, Ronnie; Guimarães, Luis; Azevedo, Vasco; Silva, Artur; Ghosh, Preetam; Morais, Jefferson; Ramos, Rommel T. J.

doi:10.1007/978-3-030-46417-2_8

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 11347))

Included in the following conference series:

Brazilian Symposium on Bioinformatics

359 Accesses
1 Citations

Abstract

Housekeeping genes (HKGs), are essential for gene expression based studies performed through Reverse Transcriptase-polymerase Chain Reaction (RT-qPCR). These genes are related with the basic cellular processes that are essential for cell maintenance, survival and function. Thus, HKGs should be expressed in all cells of an organism regardless of the tissue type, cell state or cell condition. High-throughput technologies, including RNA sequencing (RNA-seq), are used to study and identify these types of genes. RNA-seq is a high-throughput method that allows the measurement of gene expression profiles in a target tissue or an isolated cell. Moreover, machine learning methods are routinely applied in different genomics related areas to enable the interpretation of large datasets, including those related to gene expression. This study reports a new machine learning based approach to identify candidate HKGs in silico from RNA-seq gene expression data. The approach enabled the identification of stable HKGs candidates in RNA-seq data from Corynebacterium pseudotuberculosis. These genes showed stable expression under different stress conditions as well as low variation index and fold changes. Furthermore, some of these genes were already reported in the literature as HKGs or HKGs candidates for the same or other bacterial organisms, which reinforced the accuracy of the proposed method. We present a novel approach based on K-means algorithm, internal metrics and machine learning methods that can identify stable housekeeping genes from gene expression data with high accuracy and efficiency.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Andritsos, P., et al.: Data clustering techniques. Rapport technique. University of Toronto. Department of Computer Science (2002)
Google Scholar
Berkhin, P.: A survey of clustering data mining techniques. In: Kogan, J., Nicholas, C., Teboulle, M. (eds.) Grouping Multidimensional Data, pp. 25–71. Springer, Heidelberg (2006)
Chapter Google Scholar
Bolshakova, N., Azuaje, F.: Cluster validation techniques for genome expression data. Sig. Process. 83(4), 825–833 (2003). https://doi.org/10.1016/S0165-1684(02)00475-9
Article MATH Google Scholar
Brock, G., Pihur, V., Datta, S.: clValid: an R package for cluster validation. J. Stat. Softw. 25, 1–32 (2008)
Article Google Scholar
Brun, M., et al.: Model-based evaluation of clustering validation measures. Pattern Recogn. 40(3), 807–824 (2007)
Article Google Scholar
Charrad, M., Ghazzali, N., Boiteau, V., Niknafs, A.: NbClust: an R package for determining the relevant number of clusters in a data set. J. Stat. Softw., Art. 61(6), 1–36 (2014). https://doi.org/10.18637/jss.v061.i06
Chen, W.H., Minguez, P., Lercher, M.J., Bork, P.: OGEE: an online gene essentiality database. Nucleic Acids Res. 40(D1), D901–D906 (2011)
Article Google Scholar
Ching, T., et al.: Opportunities and obstacles for deep learning in biology and medicine. J. R. Soc. Interface 15(141), 20170387 (2018)
Article Google Scholar
Dalton, L., Ballarin, V., Brun, M.: Clustering algorithms: on learning, validation, performance, and applications to genomics. Curr. Genomics 10(6), 430–445 (2009). https://doi.org/10.2174/138920209789177601
Article Google Scholar
De Ferrari, L., Aitken, S.: Mining housekeeping genes with a Naive Bayes classifier. BMC Genomics 7(1), 277 (2006). https://doi.org/10.1186/1471-2164-7-277
Article Google Scholar
Dheda, K., Huggett, J.F., Bustin, S.A., Johnson, M.A., Rook, G., Zumla, A.: Validation of housekeeping genes for normalizing RNA expression in real-time PCR. BioTechniques 37(1), 112–119 (2004)
Article Google Scholar
Dong, B., et al.: Predicting housekeeping genes based on Fourier analysis. PLoS One 6(6), e21012 (2011)
Article Google Scholar
Eisenberg, E., Levanon, E.Y.: Human housekeeping genes, revisited. Trends Genet. 29(10), 569–574 (2013)
Article Google Scholar
Fayyad, U.M., Piatetsky-Shapiro, G., Smyth, P.: Knowledge discovery and data mining: towards a unifying framework. In: Proceedings of 2nd International Conference on Knowledge Discovery and Data Mining, Portland, OR, pp. 82–88 (1996) https://doi.org/10.1.1.27.363
Ghazzali, N.: NbClust: an R package for determining the relevant number of clusters in a data set. J. Stat. Softw. 61(6), 1–36 (2014)
Google Scholar
Halkidi, M., Vazirgiannis, M.: Clustering validity assessment: finding the optimal partitioning of a data set. In: Proceedings 2001 IEEE International Conference on Data Mining, pp. 187–194. IEEE (2001)
Google Scholar
Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The WEKA data mining software. ACM SIGKDD Explorations 11(1), 10–18 (2009). https://doi.org/10.1145/1656274.1656278
Article Google Scholar
Han, J., Kamber, M., Pei, J.: Data Mining: Concepts and Tecniques, 3rd edn. Morgan Kaufmann/Elsevier, Walthan (2011)
Google Scholar
de Jonge, H.J.M., et al.: Evidence based selection of housekeeping genes. PLoS One 2(9), 1–5 (2007). https://doi.org/10.1371/journal.pone.0000898
Article Google Scholar
Kovács, F., Legány, C., Babos, A.: Cluster validity measurement techniques. In: Proceedings of the 6th International Symposium of Hungarian Researchers on Computational Intelligence, pp. 1–11 (2005)
Google Scholar
Kozera, B., Rapacz, M.: Reference genes in real-time PCR. J. Appl. Genet. 54(4), 391–406 (2013)
Article Google Scholar
Lercher, M.J., Urrutia, A.O., Hurst, L.D.: Clustering of housekeeping genes provides a unified model of gene order in the human genome. Nat. Genet. 31(2), 180–183 (2002). https://doi.org/10.1038/ng887
Article Google Scholar
Libbrecht, M.W., Noble, W.S.: Machine learning applications in genetics and genomics. Nat. Rev. Genet. 16(6), 321–332 (2015). https://doi.org/10.1038/nrg3920
Article Google Scholar
Lin, Y., et al.: Evaluating stably expressed genes in single cells. bioRxiv p. 229815 (2018)
Google Scholar
Liu, P., Si, Y.: Cluster analysis of RNA-sequencing data. In: Datta, S., Nettleton, D. (eds.) Statistical Analysis of Next Generation Sequencing Data. FPSS, pp. 191–217. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-07212-8_10
Chapter Google Scholar
Maimon, O., Rokach, L.: Introduction to knowledge discovery and data mining. In: Maimon, O., Rokach, L. (eds.) Data Mining and Knowledge Discovery Handbook, pp. 1–15. Springer, Boston (2009). https://doi.org/10.1007/978-0-387-09823-4_1
Chapter MATH Google Scholar
Oyelade, J., et al.: Clustering algorithms: their application to gene expression data. Bioinform. Biol. Insights 10, BBI-S38316 (2016)
Google Scholar
Pinto, A.C., et al.: Differential transcriptional profile of Corynebacterium pseudotuberculosis in response to abiotic stresses. BMC Genomics 15(1), 14 (2014)
Article Google Scholar
R Core Team: R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria (2018). https://www.R-project.org/
Rao, J., Liu, W., Xie, H.: A new method to identify housekeeping genes and tissue special genes. In: International Conference on Biomedical and Biological Engineering. Atlantis Press (2016)
Google Scholar
Rendón, E., Abundez, I., Arizmendi, A., Quiroz, E.M.: Internal versus external cluster validation indexes. Int. J. Comput. Commun. 5(1), 27–34 (2011)
Google Scholar
Rocha, D.J.P., Santos, C.S., Pacheco, L.G.C.: Bacterial reference genes for gene expression studies by RT-qPCR: survey and analysis. Antonie Van Leeuwenhoek 108(3), 685–693 (2015). https://doi.org/10.1007/s10482-015-0524-1
Article Google Scholar
Ross, I., Gentleman, R.: R: a language for data analysis and graphics. J. Comput. Graph. Stat. 5(3), 299–314 (1996)
Google Scholar
Si, Y., Liu, P., Li, P., Brutnell, T.P.: Model-based clustering for RNA-seq data. Bioinformatics 30(2), 197–205 (2014). https://doi.org/10.1093/bioinformatics/btt632
Article Google Scholar
Silva, A., et al.: Complete genome sequence of corynebacterium pseudotuberculosis I19, a strain isolated from a cow in israel with bovine mastitis. J. Bacteriol. 193(1), 323–324 (2011)
Article Google Scholar
Soares, S.C., et al.: Genome sequence of Corynebacterium pseudotuberculosis biovar equi strain 258 and prediction of antigenic targets to improve biotechnological vaccine production. J. Biotechnol. 167(2), 135–141 (2013). https://doi.org/10.1016/j.jbiotec.2012.11.003
Article Google Scholar
Treangen, T.J., Salzberg, S.L.: Repetitive DNA and next-generation sequencing: computational challenges and solutions. Nat. Rev. Genet. 13(1), 36–46 (2013). https://doi.org/10.1038/nrg3117.Repetitive
Article Google Scholar
Vandesompele, J., et al.: Accurate normalization of real-time quantitative RT-PCR data by geometric averaging of multiple internal control genes. Genome Biol. 3(711), 31–34 (2002). https://doi.org/10.1186/gb-2002-3-7-research0034
Article Google Scholar
Vieira, A., et al.: Comparative validation of conventional and RNA-Seq data-derived reference genes for QPCR expression studies of colletotrichum Kahawae. PLoS One 11(3), e0150651 (2016)
Article Google Scholar
Zhao, Y., Wu, J., Yang, J., Sun, S., Xiao, J., Yu, J.: PGAP: pan-genomes analysis pipeline. Bioinformatics 28(3), 416–418 (2012). https://doi.org/10.1093/bioinformatics/btr655
Article Google Scholar

Download references

Author information

Authors and Affiliations

Institute of Biological Sciences, Laboratory of Biological Engineering, Federal University of Para, Belem, Para, Brazil
Edian F. Franco, Luis Guimarães, Artur Silva & Rommel T. J. Ramos
Department of Computer Science, Computer Science Postgraduate Program (PPGCC), Federal University of Para, Belem, Para, Brazil
Edian F. Franco, Dener Maués, Jefferson Morais & Rommel T. J. Ramos
Vale Technology Institute, Belem, Para, Brazil
Ronnie Alves
Institute of Biological Sciences, Federal University of Minas Gerais-UFMG, Belo Horizonte, Minas Gerais, Brazil
Vasco Azevedo
Department of Computer Science, Virginia Commonwealth University, Richmond, VA, USA
Preetam Ghosh
Basic and Environment Science Department, Instituto Tecnológico de Santo Domingo (INTEC), Santo Domingo, Dominican Republic
Edian F. Franco

Authors

Edian F. Franco
View author publications
You can also search for this author in PubMed Google Scholar
Dener Maués
View author publications
You can also search for this author in PubMed Google Scholar
Ronnie Alves
View author publications
You can also search for this author in PubMed Google Scholar
Luis Guimarães
View author publications
You can also search for this author in PubMed Google Scholar
Vasco Azevedo
View author publications
You can also search for this author in PubMed Google Scholar
Artur Silva
View author publications
You can also search for this author in PubMed Google Scholar
Preetam Ghosh
View author publications
You can also search for this author in PubMed Google Scholar
Jefferson Morais
View author publications
You can also search for this author in PubMed Google Scholar
Rommel T. J. Ramos
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Rommel T. J. Ramos .

Editor information

Editors and Affiliations

Fluminense Federal University, Niterói, Brazil
Luis Kowada
Fluminense Federal University, Niterói, Brazil
Daniel de Oliveira

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Franco, E.F. et al. (2020). A Clustering Approach to Identify Candidates to Housekeeping Genes Based on RNA-seq Data. In: Kowada, L., de Oliveira, D. (eds) Advances in Bioinformatics and Computational Biology. BSB 2019. Lecture Notes in Computer Science(), vol 11347. Springer, Cham. https://doi.org/10.1007/978-3-030-46417-2_8

Download citation

DOI: https://doi.org/10.1007/978-3-030-46417-2_8
Published: 29 April 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-46416-5
Online ISBN: 978-3-030-46417-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics