Integrating Biological Context into the Analysis of Gene Expression Data

  • Cindy PerscheidEmail author
  • Matthias Uflacker
Conference paper
Part of the Advances in Intelligent Systems and Computing book series (AISC, volume 801)


High-throughput RNA sequencing produces large gene expression datasets whose analysis leads to a better understanding of diseases like cancer. The nature of RNA-Seq data poses challenges to its analysis in terms of its high dimensionality, noise, and complexity of the underlying biological processes. Researchers apply traditional machine learning approaches, e.g. hierarchical clustering, to analyze this data. Until it comes to validation of the results, the analysis is based on the provided data only and completely misses the biological context.

However, gene expression data follows particular patterns – the underlying biological processes. In our research, we aim to integrate the available biological knowledge earlier in the analysis process. We want to adapt state-of-the-art data mining algorithms to consider the biological context in their computations and deliver meaningful results for researchers.


Gene expression Machine learning Feature selection Association rule mining Biclustering Knowledge bases 


  1. 1.
    Acharya, S., Saha, S., Nikhil, N.: Unsupervised gene selection using biological knowledge: application in sample clustering. BMC Bioinform. 18(1), 513 (2017)CrossRefGoogle Scholar
  2. 2.
    Babu, M.M.: Introduction to microarray data analysis. Comput. Genomics Theory Appl. 17(6), 225–249 (2004)Google Scholar
  3. 3.
    Bellazzi, R., Zupan, B.: Towards knowledge-based gene expression data mining. J. Biomed. Inform. 40(6), 787–802 (2007)CrossRefGoogle Scholar
  4. 4.
    Gene Ontology Consortium: expansion of the gene ontology knowledgebase and resources. Nucleic Acids Res. 45(D1), D331–D338 (2016)Google Scholar
  5. 5.
    UniProt Consortium: UniProt: the universal protein knowledgebase. Nucleic Acids Res. 45(D1), D158–D169 (2016)Google Scholar
  6. 6.
    NCBI Resource Coordinators: database resources of the national center for biotechnology information. Nucleic Acids Res. 44(Database issue), D7 (2016)Google Scholar
  7. 7.
    van Dam, S., Craig, T., de Magalhaes, J.P.: GeneFriends: a human RNA-seq-based gene and transcript co-expression database. Nucleic Acids Res. 43(D1), D1124–D1132 (2014)Google Scholar
  8. 8.
    Fang, O.H., et al.: An integrative gene selection with association analysis for microarray data classification. Intell. Data Anal. 18(4), 739–758 (2014)CrossRefGoogle Scholar
  9. 9.
    Farkas, I.J., Szántó-Várnagy, Á., Korcsmáros, T.: Linking proteins to signaling pathways for experiment design and evaluation. PloS ONE 7(4), e36202 (2012)CrossRefGoogle Scholar
  10. 10.
    Inza, I., et al.: Filter versus wrapper gene selection approaches in DNA microarray domains. Artif. Intell. Med. 31(2), 91–103 (2004)CrossRefGoogle Scholar
  11. 11.
    Jiang, D., Tang, C., Zhang, A.: Cluster analysis for gene expression data: a survey. IEEE Trans. Knowl. Data Eng. (TKDE) 16(11), 1370–1386 (2004)CrossRefGoogle Scholar
  12. 12.
    Kamburov, A., et al.: ConsensusPathDB: toward a more complete picture of cell biology. Nucleic Acids Res. 39(suppl\({\_}\)1), D712–D717 (2010)CrossRefGoogle Scholar
  13. 13.
    Kanehisa, M., Goto, S.: KEGG: kyoto encyclopedia of genes and genomes. Nucleic Acids Res. 28(1), 27–30 (2000)CrossRefGoogle Scholar
  14. 14.
    Kohavi, R., John, G.H.: Wrappers for feature subset selection. Artif. Intell. 97(1–2), 273–324 (1997)CrossRefGoogle Scholar
  15. 15.
    Kukurba, K.R., Montgomery, S.B.: RNA sequencing and analysis. Cold Spring Harbor Protocols 2015(11) (2015). pdb–top084970CrossRefGoogle Scholar
  16. 16.
    Lazar, C., et al.: A survey on filter techniques for feature selection in gene expression microarray analysis. IEEE/ACM Trans. Comput. Biol. Bioinform. (TCBB) 9(4), 1106–1119 (2012)CrossRefGoogle Scholar
  17. 17.
    Mahajan, S., Singh, S., et al.: Review on feature selection approaches using gene expression data. Imperial J. Interdisc. Res. 2(3) (2016)Google Scholar
  18. 18.
    Okamura, Y., et al.: COXPRESdb in 2015: coexpression database for animal species by dna-microarray and rnaseq-based expression data with multiple quality assessment systems. Nucleic Acids Res. 43(D1), D82–D86 (2014)CrossRefGoogle Scholar
  19. 19.
    Pasquier, N., et al.: Mining gene expression data using domain knowledge. Int. J. Softw. Inform. (IJSI) 2(2), 215–231 (2008)Google Scholar
  20. 20.
    Piñero, J., et al.: DisGeNET: a discovery platform for the dynamical exploration of human diseases and their genes. Database 2015 (2015)Google Scholar
  21. 21.
    Qi, J., Tang, J.: Integrating gene ontology into discriminative powers of genes for feature selection in microarray data. In: SAC, pp. 430–434. ACM (2007)Google Scholar
  22. 22.
    Raghu, V.K., et al.: Integrated theory-and data-driven feature selection in gene expression data analysis. In: ICDE, pp. 1525–1532. IEEE (2017)Google Scholar
  23. 23.
    Saeys, Y., Inza, I., Larrañaga, P.: A review of feature selection techniques in bioinformatics. Bioinformatics 23(19), 2507–2517 (2007)CrossRefGoogle Scholar
  24. 24.
    Shao, B., Conrad, T.: Epithelial-mesenchymal transition regulatory network-based feature selection in lung cancer prognosis prediction. In: IWBBIO, pp. 135–146. Springer (2016)Google Scholar
  25. 25.
    Stark, C., et al.: BioGRID: a general repository for interaction datasets. Nucleic Acids Res. 34(suppl\({\_}\)1), D535–D539 (2006)CrossRefGoogle Scholar
  26. 26.
    Szklarczyk, D., et al.: STRING v10: protein-protein interaction networks, integrated over the tree of life. Nucleic Acids Res. 43(D1), D447–D452 (2014)CrossRefGoogle Scholar
  27. 27.
    Uhlén, M., et al.: Tissue-based map of the human proteome. Science 347(6220), 1260419 (2015)CrossRefGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  1. 1.Hasso Plattner InstituteUniversity of PotsdamPotsdamGermany

Personalised recommendations