Density-Based Clustering of Functionally Similar Genes Using Biological Knowledge
Clustering is used to identify natural groups present in the data. It has been applied widely for analyzing gene expression data to discover gene clusters that might be involved in same biological processes. This information is very important for analyzing data of fatal diseases like cancers and identifying potential diagnostic and prognostic markers. Existing clustering methods used in this regard are computationally efficient, but do not always produce biologically meaningful results. Additionally, they have one or the other shortcomings; either they are not able to deal with arbitrary-shaped clusters, require number of clusters to be specified previously or are not efficient in dealing with noise present in biological data. In this study, a new density-based clustering method specific for gene expression data is introduced that overcomes the above shortcomings and produces biologically enriched clusters of functionally similar genes by incorporating biological information from Gene Ontology (GO). The proposed method integrates the GO semantic similarity information and the correlation information between the genes for obtaining clusters. The clusters are further validated for their biological relevance using Disease Ontology, KEGG Pathway enrichment and protein-protein interaction network analysis.
KeywordsClustering Gene expression Cancer biomarkers
This work was partially supported by the Department of Science and Technology, Government of India, New Delhi (grant no. ECR/2016/001917).
- 4.Ester, M., Kriegel, H.P., Sander, J., Xu, X.: A density-based algorithm for discovering clusters a density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, KDD 1996, pp. 226–231. AAAI Press (1996)Google Scholar
- 9.Naegle, K.M., Jimenez, N., Sloutsky, R., Swamidass, S.J.: Accounting for noise when clustering biological data. Brief. Bioinform. 14, 423–436 (2012)Google Scholar