Abstract
Text mining helps biologists to collect disease-gene associations automatically from large volumes of biological literature. During the past ten years, there was a surge of interests in automatic exploration of the biomedical literature, ranging from the modest approach of annotating and extracting keywords from text to more ambitious attempts such as Natural Language Processing, text-mining based network construction and inference. In particular, these efforts effectively help biologists to identify the most likely disease candidates for further experimental validation. The most important resource for text mining applications now is the MEDLINE database developed by the National Center for Biotechnology Information (NCBI) at the National Library of Medicine (NLM). MEDLINE covers all aspects of biology, chemistry, and medicine, there is almost no limit to the types of information that may be recovered through careful and exhaustive mining [45]. Therefore, a successful text mining approach relies much on an appropriate model. To create a text mining model, the selection of Controlled Vocabulary (CV) and the representation schemes of terms occupy a central role and the efficiency of biomedical knowledge discovery varies greatly between different text mining models. To address these challenges, we propose a multi-view text mining approach to retrieve information from different biomedical domain levels and combine them to identify the disease relevant genes through prioritization. The view represents a text mining result retrieved by a specific CV, so the concept of multi-view text mining is featured as applying multiple controlled vocabularies to retrieve the gene-centric perspectives from free text publications. Since all the information is retrieved from the same MEDLINE database but only varied by the CV, the term view also indicates that the data consists of multiple domain-based perspectives of the same corpus. We expect that the correlated and complementary information contained in the multi-view textual data can facilitate the understanding about the roles of genes in genetic diseases.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Adie, E.A., Adams, R.R., Evans, K.L., Porteous, D.J., Pickard, B.S.: SUSPECTS: enabling fast and effective prioritization of positional candidates. Bioinformatics 22, 773–774 (2006)
Adie, E.A., Adams, R.R., Evans, K.L., Porteous, D.J., Pickard, B.S.: Speeding disease gene discovery by sequence based candidate prioritization. BMC Bioinformatics 6, 55 (2005)
Aerts, S., Lambrechts, D., Maity, S., Van Loo, P., Coessens, B., De Smet, F., Tranchevent, L.-C., De Moor, B., Marynen, P., Hassan, B., Carmeliet, P., Moreau, Y.: Gene prioritization through genomic data fusion. Nature Biotechnology 24, 537–544 (2006)
Asur, S., Parthasarathy, S., Ucar, D.: An ensemble framework for clustering protein-protein interaction network. Bioinformatics 23, i29–i40 (2007)
Ayad, H.G., Kamel, M.S.: Cumulative voting consensus method for partitions with a variable number of clusters. IEEE Trans. PAMI 30, 160–173 (2008)
Aymè, S.: Bridging the gap between molecular genetics and metabolic medicine: access to genetic information. European Journal of Pediatrics 159, S183–S185 (2000)
Bach, F.R., Lanckriet, G.R.G., Jordan, M.I.: Multiple kernel learning, conic duality, and the SMO algorithm. In: Proceedings of 21st International Conference of Machine Learning. ACM Press, New York (2004)
Beyer, K., Goldstein, J., Ramakrishnan, R., Shaft, U.: What is the Nearest Neighbor in High Dimensional Spaces? In: Beeri, C., Bruneman, P. (eds.) ICDT 1999. LNCS, vol. 1540, pp. 217–235. Springer, Heidelberg (1999)
Bickel, S., Scheffer, T.: Multi-View Clustering. In: Proc. of IEEE data mining Conference, pp. 19–26. IEEE, Los Alamitos (2004)
Bodenreider, O.: Lexical, Terminological, and Ontological Resources for Biological Text Mining. In: Ananiadou, S., McNaught, J.N. (eds.) Text mining for biology and biomedicine, pp. 43–66. Artech House, Boston (2006)
Chen, J.H., Zhao, Z., Ye, J.P., Liu, H.: Nonlinear adaptive distance metric learning for clustering. In: Proceeding of ACM KDD, pp. 123–132. ACM Press, New York (2007)
Chun, H.W., Yoshimasa, T., Kim, J.D., Rie, S., Naoki, N., Teruyoshi, H.: Extraction of gene-disease relations from MEDLINE using domain dictionaries and machine learning. In: Proceeding of PSB 2006, pp. 4–15 (2007)
Cohen, T.J., Barrientos, T., Hartman, Z.C., Garvey, S.M., Cox, G.A., Yao, T.P.: The deacetylase HDAC4 controls myocyte enchancing factor-2-dependent structural gene expression in response to neural activity. The FASEB Journal 23, 99–106 (2009)
Consortium: Gene Ontology: Gene ontology: tool for the unification of biology. Nature Genetics 25, 25–29 (2000)
De Bie, T., Tranchevent, L.C., Van Oeffelen, L., Moreau, Y.: Kernel-based data fusion for gene prioritization. Bioinformatics 23, i125–i123 (2007)
De Mars, G., Windelinckx, A., Beunen, G., Delecluse, G., Lefevre, J., Thomis, M.A.: Polymorphisms in the CNTF and CNTF receptor genes are associated with muscle strength in men and women. Journal of Applied Physiology 102, 1824–1831 (2007)
van Driel, M.A., Cuelenaere, K., Kemmeren, P.P.C.W., Leunissen, J.A.M., Brunner, H.G., Vriend, G.: GeneSeeker: extraction and integration of human disease-related information from web-based genetic databases. Nucleic Acids Research 33, 758–761 (2005)
Emmert, D.B., Stoehr, P.J., Stoesser, G., Cameron, G.N.: The European Bioinformatics Institute (EBI) databases. Nucleic Acids Research 26, 3445–3449 (1994)
Escarceller, M., Pluvinet, R., Sumoy, L., Estivill, X.: Identification and expression analysis of C3orf1, a novel human gene homologous to the Drosophila RP140-upstream gene. DNA Seq. 11, 335–338 (2000)
Franke, L., van Bakel, H., Fokkens, L., de Jong, E.D., Egmont-Petersen, M., Wijmenga, C.: Reconstruction of a functional human gene network, with an application for prioritizing positional candidate genes. Am. J. Hum. Genet. 78, 1011–1025 (2006)
Fred, A.L.N., Jain, A.K.: Combining multiple clusterings using evidence accumulation. IEEE Trans. PAMI 27, 835–850 (2005)
Gaulton, K.J., Mohlke, K.L., Vision, T.J.: A computational system to select candidate genes for complex human traits. Bioinformatics 23, 1132–1140 (2007)
Glenisson, P., Coessens, B., Van Vooren, S., Mathys, J., Moreau, Y., De Moor, B.: TXTGate: profiling gene groups with text-based information. Genome Biology 5, R43 (2004)
Glenisson, W., Castronovo, V., Waltregny, D.: Histone deacetylase 4 is required for TGFbeta1-induced myofibroblastic differentiation. Biochim. Biophys. Acta 1773, 1572–1582 (2007)
Hsu, C.M., Chen, M.S.: On the Design and Applicability of Distance Functions in High-Dimensional Data Space. IEEE Trans. on Knolwedge and Data Engineering 21, 523–536 (2009)
Jimeno, A., Jimenez-Ruiz, E., Lee, V., Gaudan, S., Berlanga, R., Rebholz-Schuhmann, D.: Assessment of disease named entity recognition on a corpus of annotated sentences. BMC Bioinformatics 9, S3 (2008)
Kanehisa, M., Goto, S., Kawashima, S., Nakaya, A.: The KEGG databases at GenomeNet. Nucleic Acids Research 30, 42–46 (2002)
Kelso, J., Visagie, J., Theiler, G., Christoffels, A., Bardien-Kruger, S., Smedley, D., Otgaar, D., Greyling, G., Jongeneel, V., McCarthy, M., Hide, T., Hide, W.: eVOC: A Controlled Vocabulary for Gene Expression Data. Genome Research 13, 1222–1230 (2003)
Lanckriet, G.R.G., Cristianini, N., Bartlett, P., Ghaoui, L.E., Jordan, M.I.: Learning the Kernel Matrix with Semidefinite Programming. Journal of Machine Learning Research 5, 27–72 (2004)
Lange, T., Buhmann, J.M.: Fusion of Similarity Data in Clustering. Advances in Neural Information Processing Systems 18, 723–730 (2006)
Little, G.H., Bai, Y., Williams, T., Poizat, C.: Nuclear calcium/calmodulin-dependent protein kinase IIdelta preferentially transmits signals to histone deacetylase 4 in cardiac cells. The Journal of Biological Chemistry 282, 7219–7231 (2007)
Liu, X., Yu, S., Moreau, Y., De Moor, B., Glänzel, W., Janssens, F.: Hybrid Clustering of Text Mining and Bibliometrics Applied to Journal Sets. In: Proc. of the SIAM Data Mining Conference 2009. SIAM Press, Philadelphia (2009)
Lopez-Bigas, N., Ouzounis, C.A.: Genome-wide indentification of genes likely to be involved in human genetic disease. Nucleic Acids Research 32, 3108–3114 (2004)
Mao, X.: Automated genome annotation and pathway identification using the KEGG Orthology (KO) as a controlled vocabulary. Bioinformatics 21, 3787–3793 (2005)
McKusick, V.A.: Mendelian Inheritance in Man. A Catalog of Human Genes and Genetic Disorders, 12th edn. Johns Hopkins University Press, Baltimore (1998)
Melton, G.B.: Inter-patient distance metrics using SNOMED CT defining relationships. Journal of Biomedical Informatics 39, 697–705 (2006)
Monti, S.: Consensus Clustering: A Resampling-Based Method for Class Discovery and Visualization of Gene Expression Microarray Data. Machine Learning 52, 91–118 (2003)
Mottaz, A.: Mapping proteins to disease terminologies: from UniProt to MeSH. BMC Bioinformatics 9, S5 (2008)
Neveol, A.: Multiple approaches to fine-grained indexing of the biomedical literature. In: Proceeding of PSB 2007, pp. 292–303 (2007)
Perez-Iratxeta, C., Wjst, M., Bork, P., Andrade, M.A.: G2D: a tool for mining genes associated with disease. BMC Genetics 6, 45 (2005)
Plun-Favreau, H., Elson, G., Chabbert, M., Froger, J., de Lapeyriére, O., Leliévre, E., Guillet, C., Hermann, J., Gauchat, J.F., Gascan, H., Chevalier, S.: The ciliary neurotrophic factor receptor α component induces the secretion of and is required for functional responses to cardiotrophin-like cytokine. EMBO Journal 20, 1692–1703 (2001)
Rand, W.M.: Objective criteria for the evaluation of clustering methods. Journal of the American Statistical Association (American Statistical Association) 66, 846–850 (1971)
Risch, N.J.: Searching for genetic determinants in the new millennium. Nature 405, 847–856 (2000)
Roth, S.M., Metter, E.J., Lee, M.R., Hurley, B.F., Ferrell, R.E.: C174T polymorphism in the CNTF receptor gene is associated with fat-free mass in men and women. Journal of Applied Physiology 95, 1425–1430 (2003)
Shatkay, H., Feldman, R.: Mining the biomedical literature in the genomic era: An overview. Journal of Computational Biology 10, 821–855 (2003)
Shawe-Taylor, J., Cristianin, N.: Kernel Methods for Pattern Analysis. Cambridge University Press, Cambridge (2004)
Sonnenburg, S., Rätsch, G., Schäfer, C., Schölkopf, B.: Large scale multiple kernel learning. Journal of Machine Learning Research 7, 1531–1565 (2006)
Sturm, J.F.: Using SeDuMi 1.02, a MATLAB toolbox for optimization over symmetric cones. Optimization Methods and Software 11/12, 625–653 (1999)
Smith, C.L.: The Mammalian Phenotype Ontology as a tool for annotating, analyzing and comparing phenotypic information. Genome Biology 6, R7 (2004)
Strehl, A., Ghosh, J.: Clustering Ensembles: a knowledge reuse framework for combining multiple partitions. Journal of Machine Learning Research 3, 583–617 (2002)
Stuart, J.M.: A gene-coexpression network for global discovery of con-served genetic modules. Science 302, 249–255 (2003)
Tiffin, N.: Prioritization of candidate disease genes for metabolic syndrome by computational analysis of its defining phenotypes. Physiol. Genomics 35, 55–64 (2008)
Tiffin, N., Kelso, J.F., Powell, A.R., Pan, H., Bajic, V.B., Hide, W.: Integration of text- and data-mining using ontologies successfully selects disease gene candidates. Nucleic Acids Res. 33, 1544–1552 (2005)
Topchy, A.: Clustering Ensembles: models of consensus and weak partitions. IEEE Trans. PAMI 27, 1866–1881 (2005)
Tranchevent, L., Barriot, R., Yu, S., Van Vooren, S., Van Loo, P., Coessens, B., De Moor, B., Aerts, S., Moreau, Y.: ENDEAVOUR update: a web resource for gene prioritization in multiple species. Nucleic Acids Research 36, W377–W384 (2008)
Turner, F.S., Clutterbuck, D.R., Semple, C.A.M.: POCUS: mining genomic sequence annotation to predict disease genes. Genome Biology 4, R75 (2003)
Van Vooren, S., Thienpont, B., Menten, B., Speleman, F., De Moor, B., Vermeesch, J.R., Moreau, Y.: Mapping Biomedical Concepts onto the Human Genome by Mining Literature on Chromosomal Aberrations. Nucleic Acids Research 35, 2533–2543 (2007)
Vapnik, V.: Statistical Learning Theory. Wiley Interscience, New York (1998)
Winter, R.M., Baraitser, M., Douglas, J.M.: A computerised data base for the diagnosis of rare dysmorphic syndromes. Journal of Medical Genetics 21, 121–123 (1984)
Wolf, D.M., Bodin, L.F., Bischofs, I., Price, G., Keasling, J., Arkin, A.P.: Memory in Microbes: Quantifying History-Dependent Behavior in a Bacterium. PLOSone 3, e1700 (2008)
Yamakawa, H.: Multi-aspect gene relation analysis. In: Proceeding of PSB 2005, pp. 233–244 (2005)
Ye, J.P., Ji, S.W., Chen, J.H.: Multi-class Discriminant Kernel Learning via Convex Programming. Jounral of Machine Learning Research 9, 719–758 (2008)
Yu, S., Tranchevent, L.-C., Van Vooren, S., De Moor, B., Moreau, Y.: Comparison of vocabularies, representations and ranking algorithms for gene prioritization by text mining. Bioinformatics 24, i119–i125 (2008)
Yu, S., Tranchevent, L.-C., Liu, X., Glänzel, W., Suykens, J.A.K., De Moor, B., Moreau, Y.: Optimized data fusion for kernel K-means clustering. Internal Report 08-200, ESAT-SISTA, K.U.Leuven, Lirias number: 242275 (2008) (submitted for publication)
Yu, Z.W., Wong, H.-S., Wang, H.Q.: Graph-based consensus clustering for class discovery from gene expression data. Bioinformatics 23, 2888–2896 (2007)
Zha, H., Ding, C., Gu, M., He, X., Simon, H.: Spectral relaxation for K-means clustering. Advances in Neural Information Processing Systems 13, 1057–1064 (2001)
Rights and permissions
Copyright information
© 2011 Springer-Verlag Berlin Heidelberg
About this chapter
Cite this chapter
Yu, S., Tranchevent, LC., De Moor, B., Moreau, Y. (2011). Multi-view Text Mining for Disease Gene Prioritization and Clustering. In: Kernel-based Data Fusion for Machine Learning. Studies in Computational Intelligence, vol 345. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-19406-1_5
Download citation
DOI: https://doi.org/10.1007/978-3-642-19406-1_5
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-19405-4
Online ISBN: 978-3-642-19406-1
eBook Packages: EngineeringEngineering (R0)