Skip to main content

Multi-view Text Mining for Disease Gene Prioritization and Clustering

  • Chapter
Kernel-based Data Fusion for Machine Learning

Part of the book series: Studies in Computational Intelligence ((SCI,volume 345))

Abstract

Text mining helps biologists to collect disease-gene associations automatically from large volumes of biological literature. During the past ten years, there was a surge of interests in automatic exploration of the biomedical literature, ranging from the modest approach of annotating and extracting keywords from text to more ambitious attempts such as Natural Language Processing, text-mining based network construction and inference. In particular, these efforts effectively help biologists to identify the most likely disease candidates for further experimental validation. The most important resource for text mining applications now is the MEDLINE database developed by the National Center for Biotechnology Information (NCBI) at the National Library of Medicine (NLM). MEDLINE covers all aspects of biology, chemistry, and medicine, there is almost no limit to the types of information that may be recovered through careful and exhaustive mining [45]. Therefore, a successful text mining approach relies much on an appropriate model. To create a text mining model, the selection of Controlled Vocabulary (CV) and the representation schemes of terms occupy a central role and the efficiency of biomedical knowledge discovery varies greatly between different text mining models. To address these challenges, we propose a multi-view text mining approach to retrieve information from different biomedical domain levels and combine them to identify the disease relevant genes through prioritization. The view represents a text mining result retrieved by a specific CV, so the concept of multi-view text mining is featured as applying multiple controlled vocabularies to retrieve the gene-centric perspectives from free text publications. Since all the information is retrieved from the same MEDLINE database but only varied by the CV, the term view also indicates that the data consists of multiple domain-based perspectives of the same corpus. We expect that the correlated and complementary information contained in the multi-view textual data can facilitate the understanding about the roles of genes in genetic diseases.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 149.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 199.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 199.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Adie, E.A., Adams, R.R., Evans, K.L., Porteous, D.J., Pickard, B.S.: SUSPECTS: enabling fast and effective prioritization of positional candidates. Bioinformatics 22, 773–774 (2006)

    Article  Google Scholar 

  2. Adie, E.A., Adams, R.R., Evans, K.L., Porteous, D.J., Pickard, B.S.: Speeding disease gene discovery by sequence based candidate prioritization. BMC Bioinformatics 6, 55 (2005)

    Article  Google Scholar 

  3. Aerts, S., Lambrechts, D., Maity, S., Van Loo, P., Coessens, B., De Smet, F., Tranchevent, L.-C., De Moor, B., Marynen, P., Hassan, B., Carmeliet, P., Moreau, Y.: Gene prioritization through genomic data fusion. Nature Biotechnology 24, 537–544 (2006)

    Article  Google Scholar 

  4. Asur, S., Parthasarathy, S., Ucar, D.: An ensemble framework for clustering protein-protein interaction network. Bioinformatics 23, i29–i40 (2007)

    Article  Google Scholar 

  5. Ayad, H.G., Kamel, M.S.: Cumulative voting consensus method for partitions with a variable number of clusters. IEEE Trans. PAMI 30, 160–173 (2008)

    Google Scholar 

  6. Aymè, S.: Bridging the gap between molecular genetics and metabolic medicine: access to genetic information. European Journal of Pediatrics 159, S183–S185 (2000)

    Article  Google Scholar 

  7. Bach, F.R., Lanckriet, G.R.G., Jordan, M.I.: Multiple kernel learning, conic duality, and the SMO algorithm. In: Proceedings of 21st International Conference of Machine Learning. ACM Press, New York (2004)

    Google Scholar 

  8. Beyer, K., Goldstein, J., Ramakrishnan, R., Shaft, U.: What is the Nearest Neighbor in High Dimensional Spaces? In: Beeri, C., Bruneman, P. (eds.) ICDT 1999. LNCS, vol. 1540, pp. 217–235. Springer, Heidelberg (1999)

    Chapter  Google Scholar 

  9. Bickel, S., Scheffer, T.: Multi-View Clustering. In: Proc. of IEEE data mining Conference, pp. 19–26. IEEE, Los Alamitos (2004)

    Google Scholar 

  10. Bodenreider, O.: Lexical, Terminological, and Ontological Resources for Biological Text Mining. In: Ananiadou, S., McNaught, J.N. (eds.) Text mining for biology and biomedicine, pp. 43–66. Artech House, Boston (2006)

    Google Scholar 

  11. Chen, J.H., Zhao, Z., Ye, J.P., Liu, H.: Nonlinear adaptive distance metric learning for clustering. In: Proceeding of ACM KDD, pp. 123–132. ACM Press, New York (2007)

    Google Scholar 

  12. Chun, H.W., Yoshimasa, T., Kim, J.D., Rie, S., Naoki, N., Teruyoshi, H.: Extraction of gene-disease relations from MEDLINE using domain dictionaries and machine learning. In: Proceeding of PSB 2006, pp. 4–15 (2007)

    Google Scholar 

  13. Cohen, T.J., Barrientos, T., Hartman, Z.C., Garvey, S.M., Cox, G.A., Yao, T.P.: The deacetylase HDAC4 controls myocyte enchancing factor-2-dependent structural gene expression in response to neural activity. The FASEB Journal 23, 99–106 (2009)

    Article  Google Scholar 

  14. Consortium: Gene Ontology: Gene ontology: tool for the unification of biology. Nature Genetics 25, 25–29 (2000)

    Article  Google Scholar 

  15. De Bie, T., Tranchevent, L.C., Van Oeffelen, L., Moreau, Y.: Kernel-based data fusion for gene prioritization. Bioinformatics 23, i125–i123 (2007)

    Google Scholar 

  16. De Mars, G., Windelinckx, A., Beunen, G., Delecluse, G., Lefevre, J., Thomis, M.A.: Polymorphisms in the CNTF and CNTF receptor genes are associated with muscle strength in men and women. Journal of Applied Physiology 102, 1824–1831 (2007)

    Article  Google Scholar 

  17. van Driel, M.A., Cuelenaere, K., Kemmeren, P.P.C.W., Leunissen, J.A.M., Brunner, H.G., Vriend, G.: GeneSeeker: extraction and integration of human disease-related information from web-based genetic databases. Nucleic Acids Research 33, 758–761 (2005)

    Article  Google Scholar 

  18. Emmert, D.B., Stoehr, P.J., Stoesser, G., Cameron, G.N.: The European Bioinformatics Institute (EBI) databases. Nucleic Acids Research 26, 3445–3449 (1994)

    Article  Google Scholar 

  19. Escarceller, M., Pluvinet, R., Sumoy, L., Estivill, X.: Identification and expression analysis of C3orf1, a novel human gene homologous to the Drosophila RP140-upstream gene. DNA Seq. 11, 335–338 (2000)

    Google Scholar 

  20. Franke, L., van Bakel, H., Fokkens, L., de Jong, E.D., Egmont-Petersen, M., Wijmenga, C.: Reconstruction of a functional human gene network, with an application for prioritizing positional candidate genes. Am. J. Hum. Genet. 78, 1011–1025 (2006)

    Article  Google Scholar 

  21. Fred, A.L.N., Jain, A.K.: Combining multiple clusterings using evidence accumulation. IEEE Trans. PAMI 27, 835–850 (2005)

    Google Scholar 

  22. Gaulton, K.J., Mohlke, K.L., Vision, T.J.: A computational system to select candidate genes for complex human traits. Bioinformatics 23, 1132–1140 (2007)

    Article  Google Scholar 

  23. Glenisson, P., Coessens, B., Van Vooren, S., Mathys, J., Moreau, Y., De Moor, B.: TXTGate: profiling gene groups with text-based information. Genome Biology 5, R43 (2004)

    Article  Google Scholar 

  24. Glenisson, W., Castronovo, V., Waltregny, D.: Histone deacetylase 4 is required for TGFbeta1-induced myofibroblastic differentiation. Biochim. Biophys. Acta 1773, 1572–1582 (2007)

    Article  Google Scholar 

  25. Hsu, C.M., Chen, M.S.: On the Design and Applicability of Distance Functions in High-Dimensional Data Space. IEEE Trans. on Knolwedge and Data Engineering 21, 523–536 (2009)

    Article  Google Scholar 

  26. Jimeno, A., Jimenez-Ruiz, E., Lee, V., Gaudan, S., Berlanga, R., Rebholz-Schuhmann, D.: Assessment of disease named entity recognition on a corpus of annotated sentences. BMC Bioinformatics 9, S3 (2008)

    Article  Google Scholar 

  27. Kanehisa, M., Goto, S., Kawashima, S., Nakaya, A.: The KEGG databases at GenomeNet. Nucleic Acids Research 30, 42–46 (2002)

    Article  Google Scholar 

  28. Kelso, J., Visagie, J., Theiler, G., Christoffels, A., Bardien-Kruger, S., Smedley, D., Otgaar, D., Greyling, G., Jongeneel, V., McCarthy, M., Hide, T., Hide, W.: eVOC: A Controlled Vocabulary for Gene Expression Data. Genome Research 13, 1222–1230 (2003)

    Article  Google Scholar 

  29. Lanckriet, G.R.G., Cristianini, N., Bartlett, P., Ghaoui, L.E., Jordan, M.I.: Learning the Kernel Matrix with Semidefinite Programming. Journal of Machine Learning Research 5, 27–72 (2004)

    Google Scholar 

  30. Lange, T., Buhmann, J.M.: Fusion of Similarity Data in Clustering. Advances in Neural Information Processing Systems 18, 723–730 (2006)

    Google Scholar 

  31. Little, G.H., Bai, Y., Williams, T., Poizat, C.: Nuclear calcium/calmodulin-dependent protein kinase IIdelta preferentially transmits signals to histone deacetylase 4 in cardiac cells. The Journal of Biological Chemistry 282, 7219–7231 (2007)

    Article  Google Scholar 

  32. Liu, X., Yu, S., Moreau, Y., De Moor, B., Glänzel, W., Janssens, F.: Hybrid Clustering of Text Mining and Bibliometrics Applied to Journal Sets. In: Proc. of the SIAM Data Mining Conference 2009. SIAM Press, Philadelphia (2009)

    Google Scholar 

  33. Lopez-Bigas, N., Ouzounis, C.A.: Genome-wide indentification of genes likely to be involved in human genetic disease. Nucleic Acids Research 32, 3108–3114 (2004)

    Article  Google Scholar 

  34. Mao, X.: Automated genome annotation and pathway identification using the KEGG Orthology (KO) as a controlled vocabulary. Bioinformatics 21, 3787–3793 (2005)

    Article  Google Scholar 

  35. McKusick, V.A.: Mendelian Inheritance in Man. A Catalog of Human Genes and Genetic Disorders, 12th edn. Johns Hopkins University Press, Baltimore (1998)

    Google Scholar 

  36. Melton, G.B.: Inter-patient distance metrics using SNOMED CT defining relationships. Journal of Biomedical Informatics 39, 697–705 (2006)

    Article  Google Scholar 

  37. Monti, S.: Consensus Clustering: A Resampling-Based Method for Class Discovery and Visualization of Gene Expression Microarray Data. Machine Learning 52, 91–118 (2003)

    Article  MATH  Google Scholar 

  38. Mottaz, A.: Mapping proteins to disease terminologies: from UniProt to MeSH. BMC Bioinformatics 9, S5 (2008)

    Article  Google Scholar 

  39. Neveol, A.: Multiple approaches to fine-grained indexing of the biomedical literature. In: Proceeding of PSB 2007, pp. 292–303 (2007)

    Google Scholar 

  40. Perez-Iratxeta, C., Wjst, M., Bork, P., Andrade, M.A.: G2D: a tool for mining genes associated with disease. BMC Genetics 6, 45 (2005)

    Article  Google Scholar 

  41. Plun-Favreau, H., Elson, G., Chabbert, M., Froger, J., de Lapeyriére, O., Leliévre, E., Guillet, C., Hermann, J., Gauchat, J.F., Gascan, H., Chevalier, S.: The ciliary neurotrophic factor receptor α component induces the secretion of and is required for functional responses to cardiotrophin-like cytokine. EMBO Journal 20, 1692–1703 (2001)

    Article  Google Scholar 

  42. Rand, W.M.: Objective criteria for the evaluation of clustering methods. Journal of the American Statistical Association (American Statistical Association) 66, 846–850 (1971)

    Google Scholar 

  43. Risch, N.J.: Searching for genetic determinants in the new millennium. Nature 405, 847–856 (2000)

    Article  Google Scholar 

  44. Roth, S.M., Metter, E.J., Lee, M.R., Hurley, B.F., Ferrell, R.E.: C174T polymorphism in the CNTF receptor gene is associated with fat-free mass in men and women. Journal of Applied Physiology 95, 1425–1430 (2003)

    Google Scholar 

  45. Shatkay, H., Feldman, R.: Mining the biomedical literature in the genomic era: An overview. Journal of Computational Biology 10, 821–855 (2003)

    Article  Google Scholar 

  46. Shawe-Taylor, J., Cristianin, N.: Kernel Methods for Pattern Analysis. Cambridge University Press, Cambridge (2004)

    Google Scholar 

  47. Sonnenburg, S., Rätsch, G., Schäfer, C., Schölkopf, B.: Large scale multiple kernel learning. Journal of Machine Learning Research 7, 1531–1565 (2006)

    Google Scholar 

  48. Sturm, J.F.: Using SeDuMi 1.02, a MATLAB toolbox for optimization over symmetric cones. Optimization Methods and Software 11/12, 625–653 (1999)

    Article  MathSciNet  Google Scholar 

  49. Smith, C.L.: The Mammalian Phenotype Ontology as a tool for annotating, analyzing and comparing phenotypic information. Genome Biology 6, R7 (2004)

    Article  Google Scholar 

  50. Strehl, A., Ghosh, J.: Clustering Ensembles: a knowledge reuse framework for combining multiple partitions. Journal of Machine Learning Research 3, 583–617 (2002)

    Article  MathSciNet  Google Scholar 

  51. Stuart, J.M.: A gene-coexpression network for global discovery of con-served genetic modules. Science 302, 249–255 (2003)

    Article  Google Scholar 

  52. Tiffin, N.: Prioritization of candidate disease genes for metabolic syndrome by computational analysis of its defining phenotypes. Physiol. Genomics 35, 55–64 (2008)

    Article  Google Scholar 

  53. Tiffin, N., Kelso, J.F., Powell, A.R., Pan, H., Bajic, V.B., Hide, W.: Integration of text- and data-mining using ontologies successfully selects disease gene candidates. Nucleic Acids Res. 33, 1544–1552 (2005)

    Article  Google Scholar 

  54. Topchy, A.: Clustering Ensembles: models of consensus and weak partitions. IEEE Trans. PAMI 27, 1866–1881 (2005)

    Google Scholar 

  55. Tranchevent, L., Barriot, R., Yu, S., Van Vooren, S., Van Loo, P., Coessens, B., De Moor, B., Aerts, S., Moreau, Y.: ENDEAVOUR update: a web resource for gene prioritization in multiple species. Nucleic Acids Research 36, W377–W384 (2008)

    Article  Google Scholar 

  56. Turner, F.S., Clutterbuck, D.R., Semple, C.A.M.: POCUS: mining genomic sequence annotation to predict disease genes. Genome Biology 4, R75 (2003)

    Article  Google Scholar 

  57. Van Vooren, S., Thienpont, B., Menten, B., Speleman, F., De Moor, B., Vermeesch, J.R., Moreau, Y.: Mapping Biomedical Concepts onto the Human Genome by Mining Literature on Chromosomal Aberrations. Nucleic Acids Research 35, 2533–2543 (2007)

    Article  Google Scholar 

  58. Vapnik, V.: Statistical Learning Theory. Wiley Interscience, New York (1998)

    MATH  Google Scholar 

  59. Winter, R.M., Baraitser, M., Douglas, J.M.: A computerised data base for the diagnosis of rare dysmorphic syndromes. Journal of Medical Genetics 21, 121–123 (1984)

    Article  Google Scholar 

  60. Wolf, D.M., Bodin, L.F., Bischofs, I., Price, G., Keasling, J., Arkin, A.P.: Memory in Microbes: Quantifying History-Dependent Behavior in a Bacterium. PLOSone 3, e1700 (2008)

    Google Scholar 

  61. Yamakawa, H.: Multi-aspect gene relation analysis. In: Proceeding of PSB 2005, pp. 233–244 (2005)

    Google Scholar 

  62. Ye, J.P., Ji, S.W., Chen, J.H.: Multi-class Discriminant Kernel Learning via Convex Programming. Jounral of Machine Learning Research 9, 719–758 (2008)

    MathSciNet  Google Scholar 

  63. Yu, S., Tranchevent, L.-C., Van Vooren, S., De Moor, B., Moreau, Y.: Comparison of vocabularies, representations and ranking algorithms for gene prioritization by text mining. Bioinformatics 24, i119–i125 (2008)

    Article  Google Scholar 

  64. Yu, S., Tranchevent, L.-C., Liu, X., Glänzel, W., Suykens, J.A.K., De Moor, B., Moreau, Y.: Optimized data fusion for kernel K-means clustering. Internal Report 08-200, ESAT-SISTA, K.U.Leuven, Lirias number: 242275 (2008) (submitted for publication)

    Google Scholar 

  65. Yu, Z.W., Wong, H.-S., Wang, H.Q.: Graph-based consensus clustering for class discovery from gene expression data. Bioinformatics 23, 2888–2896 (2007)

    Article  Google Scholar 

  66. Zha, H., Ding, C., Gu, M., He, X., Simon, H.: Spectral relaxation for K-means clustering. Advances in Neural Information Processing Systems 13, 1057–1064 (2001)

    Google Scholar 

Download references

Authors

Rights and permissions

Reprints and permissions

Copyright information

© 2011 Springer-Verlag Berlin Heidelberg

About this chapter

Cite this chapter

Yu, S., Tranchevent, LC., De Moor, B., Moreau, Y. (2011). Multi-view Text Mining for Disease Gene Prioritization and Clustering. In: Kernel-based Data Fusion for Machine Learning. Studies in Computational Intelligence, vol 345. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-19406-1_5

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-19406-1_5

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-19405-4

  • Online ISBN: 978-3-642-19406-1

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics