Skip to main content

Mining Biomedical Text towards Building a Quantitative Food-Disease-Gene Network

  • Chapter

Part of the book series: Studies in Computational Intelligence ((SCI,volume 375))

Abstract

Advances in bio-technology and life sciences are leading to an ever-increasing volume of published research data, predominantly in unstructured text. To uncover the underlying knowledge base hidden in such data, text mining techniques have been utilized. Past and current efforts in this area have been largely focusing on recognizing gene and protein names, and identifying binary relationships among genes or proteins. In this chapter, we present an information extraction system that analyzes publications in an emerging discipline–Nutritional Genomics, a discipline that studies the interactions amongst genes, foods and diseases–aiming to build a quantitative food-disease-gene network. To this end, we adopt a host of techniques including natural language processing (NLP) techniques, domain ontology, and machine learning approaches.

Specifically, the proposed system is composed of four main modules: (1) named entity recognition, which extracts five types of entities including foods, chemicals, diseases, proteins and genes; (2) relationship extraction: A verb-centric approach is implemented to extract binary relationships between two entities; (3) relationship polarity and strength analysis: We have constructed novel features to capture the syntactic, semantic and structural aspects of a relationship. A 2-phase Support Vector Machine is then used to classify the polarity, whereas a Support Vector Regression learner is applied to rate the strength level of a relationship; and (4) relationship integration and visualization, which integrates the previously extracted relationships and realizes a preliminary user interface for intuitive observation and exploration.

Empirical evaluations of the first three modules demonstrate the efficacy of this system. The entity recognition module achieved a balanced precision and recall with an average F-score of 0.89. The average F-score of the extracted relationships is 0.905. Finally, an accuracy of 0.91 and 0.96 was achieved in classifying the relationship polarity and rating the relationship strength level, respectively..

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   129.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD   169.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Aronson, A.R.: Effective mapping of biomedical text to the umls metathesaurus: The metamap program (2001)

    Google Scholar 

  2. Carlson, A., Betteridge, J., Wang, R.C., Hruschka Jr., E.R., Mitchell, T.M.: Coupled semi-supervised learning for information extraction. In: WSDM 2010: Proceedings of the Third ACM International Conference on Web Search and Data Mining, pp. 101–110. ACM, New York (2010)

    Chapter  Google Scholar 

  3. Corney, D.P., Buxton, B.F., Langdon, W.B., Jones, D.T.: Biorat: extracting biological information from full-length papers. Bioinformatics 20(17), 3206–3213 (2004)

    Article  Google Scholar 

  4. de Marneffe, M.-C., MacCartney, B., Manning, C. D.: Generating typed dependency parses from phrase structure trees. In: LREC (2006)

    Google Scholar 

  5. Denecke, K.: Semantic structuring of and information extraction from medical documents using the umls. Methods of Information in Medicine 47(5), 425–434 (2008)

    Google Scholar 

  6. Ding, X., Liu, B., Yu, P.S.: A holistic lexicon-based approach to opinion mining. In: Proceedings of First ACM International Conference on Web Search and Data Mining, WSDM 2008 (2008)

    Google Scholar 

  7. Feldman, R., Regev, Y., Finkelstein-Landau, M., Hurvitz, E., Kogan, B.: Mining biomedical literature using information extraction. Current Drug Discovery (2002)

    Google Scholar 

  8. Fellbaum, C.: WordNet: An Electronic Lexical Database. MIT Press, Cambridge (1998)

    MATH  Google Scholar 

  9. Fukuda, K., Tsunoda, T., Tamura, A., Takagi, T.: Toward information extraction: Identifying protein names from biological papers. In: Proceedings of the Pacific Symposium on Biocomputing (1998)

    Google Scholar 

  10. Fundel, K., Kffner, R., Zimmer, R.: Relex - relation extraction using dependency parse trees. Bioinformatics 23(3), 365–371 (2007)

    Article  Google Scholar 

  11. Guyon, I., Elisseeff, A.: An introduction to variable and feature selection. Journal of Machine Learning Research 3, 1157–1182 (2003)

    MATH  Google Scholar 

  12. Hakenberg, J., Leaman, R., Vo, N.H., Jonnalagadda, S., Sullivan, R., Miller, C., Tari, L., Baral, C., Gonzalez, G.: Efficient extraction of protein-protein interactions from full-text articles. IEEE/ACM Trans. Comput. Biology Bioinform. 7(3), 481–494 (2010)

    Article  Google Scholar 

  13. Heer, J., Card, S.K., Landay, J.A.: Prefuse: a toolkit for interactive information visualization. In: Proc. CHI 2005, Human Factors in Computing Systems (2005)

    Google Scholar 

  14. Hoffmann, R., Valencia, A.: A gene network for navigating the literature. Nature Genetics 36, 664 (2004)

    Article  Google Scholar 

  15. Hu, X., Wu, D.D.: Data mining and predictive modeling of biomolecular network from biomedical literature databases. IEEE/ACM Trans. Comp. Biol. Bioinf. 4(2), 251–263 (2006)

    Article  Google Scholar 

  16. Hur, J., Schuyler, A.D., States, D.J., Feldman, E.L.: Sciminer: web-based literature mining tool for target identification and functional enrichment analysis. Bioinformatics 840, 838–840 (2009)

    Article  Google Scholar 

  17. Kim, J.j., Zhang, Z., Park, J.C., Ng, S.-K.: Biocontrasts: Extracting and exploiting protein-protein contrastive relations from biomedical literature. Bioinformatics 22(5), 597–605 (2006)

    Article  Google Scholar 

  18. Jensen, L.J., Kuhn, M., Stark, M., Chaffron, S., Creevey, C., Muller, J., Doerks, T., Julien, P., Roth, A., Simonovic, M., Bork, P., von Mering, C.: String 8–a global view on proteins and their functional interactions in 630 organisms. Nucleic acids research 37(database issue), 412–416 (2009)

    Article  Google Scholar 

  19. Joachims, T.: Text categorization with support vector machines: learning with many relevant features. In: Nédellec, C., Rouveirol, C. (eds.) ECML 1998. LNCS, vol. 1398, Springer, Heidelberg (1998)

    Google Scholar 

  20. Kaput, J., Rodriguez, R.: Nutritional genomics: the next frontier in the postgenomic era. Physiological Genomics 16, 166–177 (2004)

    Article  Google Scholar 

  21. Kipper, K., Korhonen, A., Ryant, N., Palmer, M.: Extending verbnet with novel verb classes. In: Proceedings of the Fifth International Conference on Language Resources and Evaluation LREC (January 2006), http://verbs.colorado.edu/~mpalmer/projects/verbnet.html

  22. Kobayashi, N., Iida, R., Inui, K., Matsumoto, Y.: Opinion mining on the web by extracting subject-attribute-value relations. In: Proceedings of AAAI 2006 Spring Sympoia on Computational Approaches to Analyzing Weblogs AAAI-CAAW 2006 (2006)

    Google Scholar 

  23. Krallinger, M., Morgan, A., Smith, L., Leitner, F., Tanabe, L., Wilbur, J., Hirschman, L., Valencia, A.: Evaluation of text-mining systems for biology: overview of the second biocreative community challenge. Genome Biol. 9(suppl. 2), S1 (2008)

    Article  Google Scholar 

  24. Kuhn, M., Szklarczyk, D., Franceschini, A., Campillos, M., von Mering, C., Jensen, L.J., Beyer, A., Bork, P.: Stitch 2: an interaction network database for small molecules and proteins. Nucleic Acids Research, 1–5 (2009)

    Google Scholar 

  25. Lindsay, R.K., Gordon, M.D.: Literature-based discovery by lexical statistics. Journal of the American Society for Information Science 50(7), 574–587 (1999)

    Article  Google Scholar 

  26. Liu, B.: Sentiment analysis and subjectivity. In: Handbook of Natural Language Processing, 2nd edn. (2010)

    Google Scholar 

  27. Maskarinec, G., Morimoto, Y., Novotny, R., Nordt, F.J., Stanczyk, F.Z., Franke, A.A.: Urinary sex steroid excretion levels during a soy intervention among young girls: a pilot study. Nut. Can. 52(1), 22–28 (2005)

    Article  Google Scholar 

  28. McDonald, R., Hannan, K., Neylon, T., Wells, M., Reynar, J.: Structured models for fine-to-coarse sentiment analysis. In: Proceedings of the Association for Computational Linguistics (ACL), pp. 432–439. Association for Computational Linguistics, Prague (2007)

    Google Scholar 

  29. MEDLINE. Medline (1999), http://www.nlm.nih.gov/databases/databases_medline.html

  30. Mei, Q., Ling, X., Wondra, M., Su, H., Zhai, C.X.: Topic sentiment mixture: Modeling facets and opinions in weblogs. In: Proceedings of WWW, pp. 171–180. ACM Press, New York (2007)

    Google Scholar 

  31. Müller, H.-M., Kenny, E.E., Sternberg, P.W.: Textpresso: an ontology-based information retrieval and extraction system for biological literature. PLoS Biol. 2(11), e309 (2004)

    Article  Google Scholar 

  32. Niu, Y., Zhu, X., Li, J., Hirst, G.: Analysis of polarity information in medical text. In: Proceedings of the American Medical Informatics Association (2005)

    Google Scholar 

  33. Ohta, T., Tateisi, Y., Kim, J.D.: The genia corpus: An annotated research abstract corpus in molecular biology domain. In: The Human Language Technology Conference (2002)

    Google Scholar 

  34. Ohta, T., Tsuruoka, Y., Tateisi, Y.: Introduction to the bioentity recognition task at jnlpba. In: Kim, J.-D. (ed.) Proc. International Joint Workshop on Natural Language Processing in Biomedicine its Applications (2004)

    Google Scholar 

  35. OpenSource. Opennlp (2010), http://opennlp.sourceforge.net/

  36. Pang, B., Lee, L., Vaithyanathan, S.: Thumbs up? sentiment classification using machine learning techniques. In: EMNLP 2002: Proceedings of the ACL 2002 Conference on Empirical Methods in Natural Language Processing, pp. 79–86. Association for Computational Linguistics, Morristown (2002)

    Chapter  Google Scholar 

  37. Sahay, S., Mukherjea, S., Agichtein, E., Garcia, E.V., Navathe, S.B., Ram, A.: Discovering semantic biomedical relations utilizing the web. ACM Trans. Knowl. Discov. Data 2(1), 1–15 (2008)

    Article  Google Scholar 

  38. Sauvaget, C., Lagarde, F.: Lifestyle factors, radiation and gastric cancer in atomic-bomb survivors in japan. Cancer Causes Control 16(7), 773–780 (2005)

    Article  Google Scholar 

  39. Settles, B.: ABNER: An open source tool for automatically tagging genes, proteins, and other entity names in text. Bioinformatics 21(14), 3191–3192 (2005)

    Article  Google Scholar 

  40. Sharma, A., Swaminathan, R., Yang, H.: A verb-centric approach for relationship extraction in biomedical text. In: The fourth IEEE International Conference on Semantic Computing, ICSC 2010 (September 2010)

    Google Scholar 

  41. Smalheiser, N.R., Swanson, D.R.: Using arrowsmith: a computer-assisted approach to formulating and assessing scientific hypotheses. Computer Methods Programs in Biomedicine 57, 149–153 (1998)

    Article  Google Scholar 

  42. Srinivasan, P.: Text mining: Generating hypotheses from medline. J. Amer. Soc. Inf. Sci. Tech. 55(5), 396–413 (2004)

    Article  Google Scholar 

  43. Swaminathan, R., Sharma, A., Yang, H.: Opinion mining for biomedical text data: Feature space design and feature selection. In: The Nineth International Workshop on Data Mining in Bioinformatics, BIOKDD 2010 (July 2010)

    Google Scholar 

  44. Swanson, D.R.: Fish oil, raynauds syndrome, and undiscovered public knowledge. Perspect. Bio. Med. 30, 7–18 (1986)

    Google Scholar 

  45. Tanabe, L.K., John Wilbur, W.: Tagging gene and protein names in full text articles. In: ACL Workshop on Nat. Lang. Proc. in the Biomedical Domain, pp. 9–13 (2002)

    Google Scholar 

  46. Tanenblatt, M., Coden, A., Sominsky, I.: The conceptmapper approach to named entity recognition. In: Calzolari, N., Choukri, K., Maegaard, B., Mariani, J., Odijk, J., Piperidis, S., Rosner, M., Tapias, D. (eds.) Proceedings of the Seventh conference on International Language Resources and Evaluation (LREC 2010). European Language Resources Association (ELRA), Valletta (2010)

    Google Scholar 

  47. Torvik, V.I., Smalheiser, N.R.: A quantitative model for linking two disparate sets of articles in medline. Bioinformatics 23(13), 1658–1665 (2007)

    Article  Google Scholar 

  48. Tsai, R., Chou, W.-C., Su, Y.-S., Lin, Y.-C., Sung, C.-L., Dai, H.-J., Yeh, I., Ku, W., Sung, T.-Y., Hsu, W.-L.: Biosmile: A semantic role labeling system for biomedical verbs using a maximum-entropy model with automatically generated template features. BMC Bioinformatics 8(1), 325 (2007)

    Article  Google Scholar 

  49. USDA. Usda national nutrient database for standard reference, release 17 (2006), http://www.nal.usda.gov/fnic/foodcomp/data/

  50. Waldschlger, J., Bergemann, C., Ruth, W., Effmert, U., Jeschke, U., Richter, D.U., Kragl, U., Piechulla, B., Briese, V.: Flax-seed extracts with phytoestrogenic effects on a hormone receptor-positive tumour cell line. Anticancer Res. 25(3A), 1817–1822 (2005)

    Google Scholar 

  51. Wang, Y., Patrick Cascading, J.: classifiers for named entity recognition in clinical notes. In: Workshop Biomedical Information Extraction, pp. 42–49 (2009)

    Google Scholar 

  52. Yang, C.S., Wang, X.: Green tea and cancer prevention. Nutr Cancer 62(7), 931–937 (2010)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2011 Springer-Verlag Berlin Heidelberg

About this chapter

Cite this chapter

Yang, H., Swaminathan, R., Sharma, A., Ketkar, V., D‘Silva, J. (2011). Mining Biomedical Text towards Building a Quantitative Food-Disease-Gene Network. In: Biba, M., Xhafa, F. (eds) Learning Structure and Schemas from Documents. Studies in Computational Intelligence, vol 375. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-22913-8_10

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-22913-8_10

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-22912-1

  • Online ISBN: 978-3-642-22913-8

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics