Multi-view Text Mining for Disease Gene Prioritization and Clustering

Yu, Shi; Tranchevent, Léon-Charles; De Moor, Bart; Moreau, Yves

doi:10.1007/978-3-642-19406-1_5

Shi Yu,
Léon-Charles Tranchevent,
Bart De Moor &
…
Yves Moreau

Part of the book series: Studies in Computational Intelligence ((SCI,volume 345))

2609 Accesses
2 Citations

Abstract

Text mining helps biologists to collect disease-gene associations automatically from large volumes of biological literature. During the past ten years, there was a surge of interests in automatic exploration of the biomedical literature, ranging from the modest approach of annotating and extracting keywords from text to more ambitious attempts such as Natural Language Processing, text-mining based network construction and inference. In particular, these efforts effectively help biologists to identify the most likely disease candidates for further experimental validation. The most important resource for text mining applications now is the MEDLINE database developed by the National Center for Biotechnology Information (NCBI) at the National Library of Medicine (NLM). MEDLINE covers all aspects of biology, chemistry, and medicine, there is almost no limit to the types of information that may be recovered through careful and exhaustive mining [45]. Therefore, a successful text mining approach relies much on an appropriate model. To create a text mining model, the selection of Controlled Vocabulary (CV) and the representation schemes of terms occupy a central role and the efficiency of biomedical knowledge discovery varies greatly between different text mining models. To address these challenges, we propose a multi-view text mining approach to retrieve information from different biomedical domain levels and combine them to identify the disease relevant genes through prioritization. The view represents a text mining result retrieved by a specific CV, so the concept of multi-view text mining is featured as applying multiple controlled vocabularies to retrieve the gene-centric perspectives from free text publications. Since all the information is retrieved from the same MEDLINE database but only varied by the CV, the term view also indicates that the data consists of multiple domain-based perspectives of the same corpus. We expect that the correlated and complementary information contained in the multi-view textual data can facilitate the understanding about the roles of genes in genetic diseases.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 149.00; Price excludes VAT (USA)

Softcover Book: USD 199.99; Price excludes VAT (USA)

Hardcover Book: USD 199.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Adie, E.A., Adams, R.R., Evans, K.L., Porteous, D.J., Pickard, B.S.: SUSPECTS: enabling fast and effective prioritization of positional candidates. Bioinformatics 22, 773–774 (2006)
Article Google Scholar
Adie, E.A., Adams, R.R., Evans, K.L., Porteous, D.J., Pickard, B.S.: Speeding disease gene discovery by sequence based candidate prioritization. BMC Bioinformatics 6, 55 (2005)
Article Google Scholar
Aerts, S., Lambrechts, D., Maity, S., Van Loo, P., Coessens, B., De Smet, F., Tranchevent, L.-C., De Moor, B., Marynen, P., Hassan, B., Carmeliet, P., Moreau, Y.: Gene prioritization through genomic data fusion. Nature Biotechnology 24, 537–544 (2006)
Article Google Scholar
Asur, S., Parthasarathy, S., Ucar, D.: An ensemble framework for clustering protein-protein interaction network. Bioinformatics 23, i29–i40 (2007)
Article Google Scholar
Ayad, H.G., Kamel, M.S.: Cumulative voting consensus method for partitions with a variable number of clusters. IEEE Trans. PAMI 30, 160–173 (2008)
Google Scholar
Aymè, S.: Bridging the gap between molecular genetics and metabolic medicine: access to genetic information. European Journal of Pediatrics 159, S183–S185 (2000)
Article Google Scholar
Bach, F.R., Lanckriet, G.R.G., Jordan, M.I.: Multiple kernel learning, conic duality, and the SMO algorithm. In: Proceedings of 21st International Conference of Machine Learning. ACM Press, New York (2004)
Google Scholar
Beyer, K., Goldstein, J., Ramakrishnan, R., Shaft, U.: What is the Nearest Neighbor in High Dimensional Spaces? In: Beeri, C., Bruneman, P. (eds.) ICDT 1999. LNCS, vol. 1540, pp. 217–235. Springer, Heidelberg (1999)
Chapter Google Scholar
Bickel, S., Scheffer, T.: Multi-View Clustering. In: Proc. of IEEE data mining Conference, pp. 19–26. IEEE, Los Alamitos (2004)
Google Scholar
Bodenreider, O.: Lexical, Terminological, and Ontological Resources for Biological Text Mining. In: Ananiadou, S., McNaught, J.N. (eds.) Text mining for biology and biomedicine, pp. 43–66. Artech House, Boston (2006)
Google Scholar
Chen, J.H., Zhao, Z., Ye, J.P., Liu, H.: Nonlinear adaptive distance metric learning for clustering. In: Proceeding of ACM KDD, pp. 123–132. ACM Press, New York (2007)
Google Scholar
Chun, H.W., Yoshimasa, T., Kim, J.D., Rie, S., Naoki, N., Teruyoshi, H.: Extraction of gene-disease relations from MEDLINE using domain dictionaries and machine learning. In: Proceeding of PSB 2006, pp. 4–15 (2007)
Google Scholar
Cohen, T.J., Barrientos, T., Hartman, Z.C., Garvey, S.M., Cox, G.A., Yao, T.P.: The deacetylase HDAC4 controls myocyte enchancing factor-2-dependent structural gene expression in response to neural activity. The FASEB Journal 23, 99–106 (2009)
Article Google Scholar
Consortium: Gene Ontology: Gene ontology: tool for the unification of biology. Nature Genetics 25, 25–29 (2000)
Article Google Scholar
De Bie, T., Tranchevent, L.C., Van Oeffelen, L., Moreau, Y.: Kernel-based data fusion for gene prioritization. Bioinformatics 23, i125–i123 (2007)
Google Scholar
De Mars, G., Windelinckx, A., Beunen, G., Delecluse, G., Lefevre, J., Thomis, M.A.: Polymorphisms in the CNTF and CNTF receptor genes are associated with muscle strength in men and women. Journal of Applied Physiology 102, 1824–1831 (2007)
Article Google Scholar
van Driel, M.A., Cuelenaere, K., Kemmeren, P.P.C.W., Leunissen, J.A.M., Brunner, H.G., Vriend, G.: GeneSeeker: extraction and integration of human disease-related information from web-based genetic databases. Nucleic Acids Research 33, 758–761 (2005)
Article Google Scholar
Emmert, D.B., Stoehr, P.J., Stoesser, G., Cameron, G.N.: The European Bioinformatics Institute (EBI) databases. Nucleic Acids Research 26, 3445–3449 (1994)
Article Google Scholar
Escarceller, M., Pluvinet, R., Sumoy, L., Estivill, X.: Identification and expression analysis of C3orf1, a novel human gene homologous to the Drosophila RP140-upstream gene. DNA Seq. 11, 335–338 (2000)
Google Scholar
Franke, L., van Bakel, H., Fokkens, L., de Jong, E.D., Egmont-Petersen, M., Wijmenga, C.: Reconstruction of a functional human gene network, with an application for prioritizing positional candidate genes. Am. J. Hum. Genet. 78, 1011–1025 (2006)
Article Google Scholar
Fred, A.L.N., Jain, A.K.: Combining multiple clusterings using evidence accumulation. IEEE Trans. PAMI 27, 835–850 (2005)
Google Scholar
Gaulton, K.J., Mohlke, K.L., Vision, T.J.: A computational system to select candidate genes for complex human traits. Bioinformatics 23, 1132–1140 (2007)
Article Google Scholar
Glenisson, P., Coessens, B., Van Vooren, S., Mathys, J., Moreau, Y., De Moor, B.: TXTGate: profiling gene groups with text-based information. Genome Biology 5, R43 (2004)
Article Google Scholar
Glenisson, W., Castronovo, V., Waltregny, D.: Histone deacetylase 4 is required for TGFbeta1-induced myofibroblastic differentiation. Biochim. Biophys. Acta 1773, 1572–1582 (2007)
Article Google Scholar
Hsu, C.M., Chen, M.S.: On the Design and Applicability of Distance Functions in High-Dimensional Data Space. IEEE Trans. on Knolwedge and Data Engineering 21, 523–536 (2009)
Article Google Scholar
Jimeno, A., Jimenez-Ruiz, E., Lee, V., Gaudan, S., Berlanga, R., Rebholz-Schuhmann, D.: Assessment of disease named entity recognition on a corpus of annotated sentences. BMC Bioinformatics 9, S3 (2008)
Article Google Scholar
Kanehisa, M., Goto, S., Kawashima, S., Nakaya, A.: The KEGG databases at GenomeNet. Nucleic Acids Research 30, 42–46 (2002)
Article Google Scholar
Kelso, J., Visagie, J., Theiler, G., Christoffels, A., Bardien-Kruger, S., Smedley, D., Otgaar, D., Greyling, G., Jongeneel, V., McCarthy, M., Hide, T., Hide, W.: eVOC: A Controlled Vocabulary for Gene Expression Data. Genome Research 13, 1222–1230 (2003)
Article Google Scholar
Lanckriet, G.R.G., Cristianini, N., Bartlett, P., Ghaoui, L.E., Jordan, M.I.: Learning the Kernel Matrix with Semidefinite Programming. Journal of Machine Learning Research 5, 27–72 (2004)
Google Scholar
Lange, T., Buhmann, J.M.: Fusion of Similarity Data in Clustering. Advances in Neural Information Processing Systems 18, 723–730 (2006)
Google Scholar
Little, G.H., Bai, Y., Williams, T., Poizat, C.: Nuclear calcium/calmodulin-dependent protein kinase IIdelta preferentially transmits signals to histone deacetylase 4 in cardiac cells. The Journal of Biological Chemistry 282, 7219–7231 (2007)
Article Google Scholar
Liu, X., Yu, S., Moreau, Y., De Moor, B., Glänzel, W., Janssens, F.: Hybrid Clustering of Text Mining and Bibliometrics Applied to Journal Sets. In: Proc. of the SIAM Data Mining Conference 2009. SIAM Press, Philadelphia (2009)
Google Scholar
Lopez-Bigas, N., Ouzounis, C.A.: Genome-wide indentification of genes likely to be involved in human genetic disease. Nucleic Acids Research 32, 3108–3114 (2004)
Article Google Scholar
Mao, X.: Automated genome annotation and pathway identification using the KEGG Orthology (KO) as a controlled vocabulary. Bioinformatics 21, 3787–3793 (2005)
Article Google Scholar
McKusick, V.A.: Mendelian Inheritance in Man. A Catalog of Human Genes and Genetic Disorders, 12th edn. Johns Hopkins University Press, Baltimore (1998)
Google Scholar
Melton, G.B.: Inter-patient distance metrics using SNOMED CT defining relationships. Journal of Biomedical Informatics 39, 697–705 (2006)
Article Google Scholar
Monti, S.: Consensus Clustering: A Resampling-Based Method for Class Discovery and Visualization of Gene Expression Microarray Data. Machine Learning 52, 91–118 (2003)
Article MATH Google Scholar
Mottaz, A.: Mapping proteins to disease terminologies: from UniProt to MeSH. BMC Bioinformatics 9, S5 (2008)
Article Google Scholar
Neveol, A.: Multiple approaches to fine-grained indexing of the biomedical literature. In: Proceeding of PSB 2007, pp. 292–303 (2007)
Google Scholar
Perez-Iratxeta, C., Wjst, M., Bork, P., Andrade, M.A.: G2D: a tool for mining genes associated with disease. BMC Genetics 6, 45 (2005)
Article Google Scholar
Plun-Favreau, H., Elson, G., Chabbert, M., Froger, J., de Lapeyriére, O., Leliévre, E., Guillet, C., Hermann, J., Gauchat, J.F., Gascan, H., Chevalier, S.: The ciliary neurotrophic factor receptor α component induces the secretion of and is required for functional responses to cardiotrophin-like cytokine. EMBO Journal 20, 1692–1703 (2001)
Article Google Scholar
Rand, W.M.: Objective criteria for the evaluation of clustering methods. Journal of the American Statistical Association (American Statistical Association) 66, 846–850 (1971)
Google Scholar
Risch, N.J.: Searching for genetic determinants in the new millennium. Nature 405, 847–856 (2000)
Article Google Scholar
Roth, S.M., Metter, E.J., Lee, M.R., Hurley, B.F., Ferrell, R.E.: C174T polymorphism in the CNTF receptor gene is associated with fat-free mass in men and women. Journal of Applied Physiology 95, 1425–1430 (2003)
Google Scholar
Shatkay, H., Feldman, R.: Mining the biomedical literature in the genomic era: An overview. Journal of Computational Biology 10, 821–855 (2003)
Article Google Scholar
Shawe-Taylor, J., Cristianin, N.: Kernel Methods for Pattern Analysis. Cambridge University Press, Cambridge (2004)
Google Scholar
Sonnenburg, S., Rätsch, G., Schäfer, C., Schölkopf, B.: Large scale multiple kernel learning. Journal of Machine Learning Research 7, 1531–1565 (2006)
Google Scholar
Sturm, J.F.: Using SeDuMi 1.02, a MATLAB toolbox for optimization over symmetric cones. Optimization Methods and Software 11/12, 625–653 (1999)
Article MathSciNet Google Scholar
Smith, C.L.: The Mammalian Phenotype Ontology as a tool for annotating, analyzing and comparing phenotypic information. Genome Biology 6, R7 (2004)
Article Google Scholar
Strehl, A., Ghosh, J.: Clustering Ensembles: a knowledge reuse framework for combining multiple partitions. Journal of Machine Learning Research 3, 583–617 (2002)
Article MathSciNet Google Scholar
Stuart, J.M.: A gene-coexpression network for global discovery of con-served genetic modules. Science 302, 249–255 (2003)
Article Google Scholar
Tiffin, N.: Prioritization of candidate disease genes for metabolic syndrome by computational analysis of its defining phenotypes. Physiol. Genomics 35, 55–64 (2008)
Article Google Scholar
Tiffin, N., Kelso, J.F., Powell, A.R., Pan, H., Bajic, V.B., Hide, W.: Integration of text- and data-mining using ontologies successfully selects disease gene candidates. Nucleic Acids Res. 33, 1544–1552 (2005)
Article Google Scholar
Topchy, A.: Clustering Ensembles: models of consensus and weak partitions. IEEE Trans. PAMI 27, 1866–1881 (2005)
Google Scholar
Tranchevent, L., Barriot, R., Yu, S., Van Vooren, S., Van Loo, P., Coessens, B., De Moor, B., Aerts, S., Moreau, Y.: ENDEAVOUR update: a web resource for gene prioritization in multiple species. Nucleic Acids Research 36, W377–W384 (2008)
Article Google Scholar
Turner, F.S., Clutterbuck, D.R., Semple, C.A.M.: POCUS: mining genomic sequence annotation to predict disease genes. Genome Biology 4, R75 (2003)
Article Google Scholar
Van Vooren, S., Thienpont, B., Menten, B., Speleman, F., De Moor, B., Vermeesch, J.R., Moreau, Y.: Mapping Biomedical Concepts onto the Human Genome by Mining Literature on Chromosomal Aberrations. Nucleic Acids Research 35, 2533–2543 (2007)
Article Google Scholar
Vapnik, V.: Statistical Learning Theory. Wiley Interscience, New York (1998)
MATH Google Scholar
Winter, R.M., Baraitser, M., Douglas, J.M.: A computerised data base for the diagnosis of rare dysmorphic syndromes. Journal of Medical Genetics 21, 121–123 (1984)
Article Google Scholar
Wolf, D.M., Bodin, L.F., Bischofs, I., Price, G., Keasling, J., Arkin, A.P.: Memory in Microbes: Quantifying History-Dependent Behavior in a Bacterium. PLOSone 3, e1700 (2008)
Google Scholar
Yamakawa, H.: Multi-aspect gene relation analysis. In: Proceeding of PSB 2005, pp. 233–244 (2005)
Google Scholar
Ye, J.P., Ji, S.W., Chen, J.H.: Multi-class Discriminant Kernel Learning via Convex Programming. Jounral of Machine Learning Research 9, 719–758 (2008)
MathSciNet Google Scholar
Yu, S., Tranchevent, L.-C., Van Vooren, S., De Moor, B., Moreau, Y.: Comparison of vocabularies, representations and ranking algorithms for gene prioritization by text mining. Bioinformatics 24, i119–i125 (2008)
Article Google Scholar
Yu, S., Tranchevent, L.-C., Liu, X., Glänzel, W., Suykens, J.A.K., De Moor, B., Moreau, Y.: Optimized data fusion for kernel K-means clustering. Internal Report 08-200, ESAT-SISTA, K.U.Leuven, Lirias number: 242275 (2008) (submitted for publication)
Google Scholar
Yu, Z.W., Wong, H.-S., Wang, H.Q.: Graph-based consensus clustering for class discovery from gene expression data. Bioinformatics 23, 2888–2896 (2007)
Article Google Scholar
Zha, H., Ding, C., Gu, M., He, X., Simon, H.: Spectral relaxation for K-means clustering. Advances in Neural Information Processing Systems 13, 1057–1064 (2001)
Google Scholar

Download references

Authors

Shi Yu
View author publications
You can also search for this author in PubMed Google Scholar
Léon-Charles Tranchevent
View author publications
You can also search for this author in PubMed Google Scholar
Bart De Moor
View author publications
You can also search for this author in PubMed Google Scholar
Yves Moreau
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Yu, S., Tranchevent, LC., De Moor, B., Moreau, Y. (2011). Multi-view Text Mining for Disease Gene Prioritization and Clustering. In: Kernel-based Data Fusion for Machine Learning. Studies in Computational Intelligence, vol 345. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-19406-1_5

Download citation

DOI: https://doi.org/10.1007/978-3-642-19406-1_5
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-19405-4
Online ISBN: 978-3-642-19406-1
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics