Mining Biomedical Text towards Building a Quantitative Food-Disease-Gene Network

Yang, Hui; Swaminathan, Rajesh; Sharma, Abhishek; Ketkar, Vilas; D‘Silva, Jason

doi:10.1007/978-3-642-22913-8_10

Mining Biomedical Text towards Building a Quantitative Food-Disease-Gene Network

Hui Yang⁴,
Rajesh Swaminathan⁴,
Abhishek Sharma⁴,
Vilas Ketkar⁴ &
…
Jason D‘Silva⁴

Chapter

698 Accesses
5 Citations

Part of the book series: Studies in Computational Intelligence ((SCI,volume 375))

Abstract

Advances in bio-technology and life sciences are leading to an ever-increasing volume of published research data, predominantly in unstructured text. To uncover the underlying knowledge base hidden in such data, text mining techniques have been utilized. Past and current efforts in this area have been largely focusing on recognizing gene and protein names, and identifying binary relationships among genes or proteins. In this chapter, we present an information extraction system that analyzes publications in an emerging discipline–Nutritional Genomics, a discipline that studies the interactions amongst genes, foods and diseases–aiming to build a quantitative food-disease-gene network. To this end, we adopt a host of techniques including natural language processing (NLP) techniques, domain ontology, and machine learning approaches.

Specifically, the proposed system is composed of four main modules: (1) named entity recognition, which extracts five types of entities including foods, chemicals, diseases, proteins and genes; (2) relationship extraction: A verb-centric approach is implemented to extract binary relationships between two entities; (3) relationship polarity and strength analysis: We have constructed novel features to capture the syntactic, semantic and structural aspects of a relationship. A 2-phase Support Vector Machine is then used to classify the polarity, whereas a Support Vector Regression learner is applied to rate the strength level of a relationship; and (4) relationship integration and visualization, which integrates the previously extracted relationships and realizes a preliminary user interface for intuitive observation and exploration.

Empirical evaluations of the first three modules demonstrate the efficacy of this system. The entity recognition module achieved a balanced precision and recall with an average F-score of 0.89. The average F-score of the extracted relationships is 0.905. Finally, an accuracy of 0.91 and 0.96 was achieved in classifying the relationship polarity and rating the relationship strength level, respectively..

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Hardcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Aronson, A.R.: Effective mapping of biomedical text to the umls metathesaurus: The metamap program (2001)
Google Scholar
Carlson, A., Betteridge, J., Wang, R.C., Hruschka Jr., E.R., Mitchell, T.M.: Coupled semi-supervised learning for information extraction. In: WSDM 2010: Proceedings of the Third ACM International Conference on Web Search and Data Mining, pp. 101–110. ACM, New York (2010)
Chapter Google Scholar
Corney, D.P., Buxton, B.F., Langdon, W.B., Jones, D.T.: Biorat: extracting biological information from full-length papers. Bioinformatics 20(17), 3206–3213 (2004)
Article Google Scholar
de Marneffe, M.-C., MacCartney, B., Manning, C. D.: Generating typed dependency parses from phrase structure trees. In: LREC (2006)
Google Scholar
Denecke, K.: Semantic structuring of and information extraction from medical documents using the umls. Methods of Information in Medicine 47(5), 425–434 (2008)
Google Scholar
Ding, X., Liu, B., Yu, P.S.: A holistic lexicon-based approach to opinion mining. In: Proceedings of First ACM International Conference on Web Search and Data Mining, WSDM 2008 (2008)
Google Scholar
Feldman, R., Regev, Y., Finkelstein-Landau, M., Hurvitz, E., Kogan, B.: Mining biomedical literature using information extraction. Current Drug Discovery (2002)
Google Scholar
Fellbaum, C.: WordNet: An Electronic Lexical Database. MIT Press, Cambridge (1998)
MATH Google Scholar
Fukuda, K., Tsunoda, T., Tamura, A., Takagi, T.: Toward information extraction: Identifying protein names from biological papers. In: Proceedings of the Pacific Symposium on Biocomputing (1998)
Google Scholar
Fundel, K., Kffner, R., Zimmer, R.: Relex - relation extraction using dependency parse trees. Bioinformatics 23(3), 365–371 (2007)
Article Google Scholar
Guyon, I., Elisseeff, A.: An introduction to variable and feature selection. Journal of Machine Learning Research 3, 1157–1182 (2003)
MATH Google Scholar
Hakenberg, J., Leaman, R., Vo, N.H., Jonnalagadda, S., Sullivan, R., Miller, C., Tari, L., Baral, C., Gonzalez, G.: Efficient extraction of protein-protein interactions from full-text articles. IEEE/ACM Trans. Comput. Biology Bioinform. 7(3), 481–494 (2010)
Article Google Scholar
Heer, J., Card, S.K., Landay, J.A.: Prefuse: a toolkit for interactive information visualization. In: Proc. CHI 2005, Human Factors in Computing Systems (2005)
Google Scholar
Hoffmann, R., Valencia, A.: A gene network for navigating the literature. Nature Genetics 36, 664 (2004)
Article Google Scholar
Hu, X., Wu, D.D.: Data mining and predictive modeling of biomolecular network from biomedical literature databases. IEEE/ACM Trans. Comp. Biol. Bioinf. 4(2), 251–263 (2006)
Article Google Scholar
Hur, J., Schuyler, A.D., States, D.J., Feldman, E.L.: Sciminer: web-based literature mining tool for target identification and functional enrichment analysis. Bioinformatics 840, 838–840 (2009)
Article Google Scholar
Kim, J.j., Zhang, Z., Park, J.C., Ng, S.-K.: Biocontrasts: Extracting and exploiting protein-protein contrastive relations from biomedical literature. Bioinformatics 22(5), 597–605 (2006)
Article Google Scholar
Jensen, L.J., Kuhn, M., Stark, M., Chaffron, S., Creevey, C., Muller, J., Doerks, T., Julien, P., Roth, A., Simonovic, M., Bork, P., von Mering, C.: String 8–a global view on proteins and their functional interactions in 630 organisms. Nucleic acids research 37(database issue), 412–416 (2009)
Article Google Scholar
Joachims, T.: Text categorization with support vector machines: learning with many relevant features. In: Nédellec, C., Rouveirol, C. (eds.) ECML 1998. LNCS, vol. 1398, Springer, Heidelberg (1998)
Google Scholar
Kaput, J., Rodriguez, R.: Nutritional genomics: the next frontier in the postgenomic era. Physiological Genomics 16, 166–177 (2004)
Article Google Scholar
Kipper, K., Korhonen, A., Ryant, N., Palmer, M.: Extending verbnet with novel verb classes. In: Proceedings of the Fifth International Conference on Language Resources and Evaluation LREC (January 2006), http://verbs.colorado.edu/~mpalmer/projects/verbnet.html
Kobayashi, N., Iida, R., Inui, K., Matsumoto, Y.: Opinion mining on the web by extracting subject-attribute-value relations. In: Proceedings of AAAI 2006 Spring Sympoia on Computational Approaches to Analyzing Weblogs AAAI-CAAW 2006 (2006)
Google Scholar
Krallinger, M., Morgan, A., Smith, L., Leitner, F., Tanabe, L., Wilbur, J., Hirschman, L., Valencia, A.: Evaluation of text-mining systems for biology: overview of the second biocreative community challenge. Genome Biol. 9(suppl. 2), S1 (2008)
Article Google Scholar
Kuhn, M., Szklarczyk, D., Franceschini, A., Campillos, M., von Mering, C., Jensen, L.J., Beyer, A., Bork, P.: Stitch 2: an interaction network database for small molecules and proteins. Nucleic Acids Research, 1–5 (2009)
Google Scholar
Lindsay, R.K., Gordon, M.D.: Literature-based discovery by lexical statistics. Journal of the American Society for Information Science 50(7), 574–587 (1999)
Article Google Scholar
Liu, B.: Sentiment analysis and subjectivity. In: Handbook of Natural Language Processing, 2nd edn. (2010)
Google Scholar
Maskarinec, G., Morimoto, Y., Novotny, R., Nordt, F.J., Stanczyk, F.Z., Franke, A.A.: Urinary sex steroid excretion levels during a soy intervention among young girls: a pilot study. Nut. Can. 52(1), 22–28 (2005)
Article Google Scholar
McDonald, R., Hannan, K., Neylon, T., Wells, M., Reynar, J.: Structured models for fine-to-coarse sentiment analysis. In: Proceedings of the Association for Computational Linguistics (ACL), pp. 432–439. Association for Computational Linguistics, Prague (2007)
Google Scholar
MEDLINE. Medline (1999), http://www.nlm.nih.gov/databases/databases_medline.html
Mei, Q., Ling, X., Wondra, M., Su, H., Zhai, C.X.: Topic sentiment mixture: Modeling facets and opinions in weblogs. In: Proceedings of WWW, pp. 171–180. ACM Press, New York (2007)
Google Scholar
Müller, H.-M., Kenny, E.E., Sternberg, P.W.: Textpresso: an ontology-based information retrieval and extraction system for biological literature. PLoS Biol. 2(11), e309 (2004)
Article Google Scholar
Niu, Y., Zhu, X., Li, J., Hirst, G.: Analysis of polarity information in medical text. In: Proceedings of the American Medical Informatics Association (2005)
Google Scholar
Ohta, T., Tateisi, Y., Kim, J.D.: The genia corpus: An annotated research abstract corpus in molecular biology domain. In: The Human Language Technology Conference (2002)
Google Scholar
Ohta, T., Tsuruoka, Y., Tateisi, Y.: Introduction to the bioentity recognition task at jnlpba. In: Kim, J.-D. (ed.) Proc. International Joint Workshop on Natural Language Processing in Biomedicine its Applications (2004)
Google Scholar
OpenSource. Opennlp (2010), http://opennlp.sourceforge.net/
Pang, B., Lee, L., Vaithyanathan, S.: Thumbs up? sentiment classification using machine learning techniques. In: EMNLP 2002: Proceedings of the ACL 2002 Conference on Empirical Methods in Natural Language Processing, pp. 79–86. Association for Computational Linguistics, Morristown (2002)
Chapter Google Scholar
Sahay, S., Mukherjea, S., Agichtein, E., Garcia, E.V., Navathe, S.B., Ram, A.: Discovering semantic biomedical relations utilizing the web. ACM Trans. Knowl. Discov. Data 2(1), 1–15 (2008)
Article Google Scholar
Sauvaget, C., Lagarde, F.: Lifestyle factors, radiation and gastric cancer in atomic-bomb survivors in japan. Cancer Causes Control 16(7), 773–780 (2005)
Article Google Scholar
Settles, B.: ABNER: An open source tool for automatically tagging genes, proteins, and other entity names in text. Bioinformatics 21(14), 3191–3192 (2005)
Article Google Scholar
Sharma, A., Swaminathan, R., Yang, H.: A verb-centric approach for relationship extraction in biomedical text. In: The fourth IEEE International Conference on Semantic Computing, ICSC 2010 (September 2010)
Google Scholar
Smalheiser, N.R., Swanson, D.R.: Using arrowsmith: a computer-assisted approach to formulating and assessing scientific hypotheses. Computer Methods Programs in Biomedicine 57, 149–153 (1998)
Article Google Scholar
Srinivasan, P.: Text mining: Generating hypotheses from medline. J. Amer. Soc. Inf. Sci. Tech. 55(5), 396–413 (2004)
Article Google Scholar
Swaminathan, R., Sharma, A., Yang, H.: Opinion mining for biomedical text data: Feature space design and feature selection. In: The Nineth International Workshop on Data Mining in Bioinformatics, BIOKDD 2010 (July 2010)
Google Scholar
Swanson, D.R.: Fish oil, raynauds syndrome, and undiscovered public knowledge. Perspect. Bio. Med. 30, 7–18 (1986)
Google Scholar
Tanabe, L.K., John Wilbur, W.: Tagging gene and protein names in full text articles. In: ACL Workshop on Nat. Lang. Proc. in the Biomedical Domain, pp. 9–13 (2002)
Google Scholar
Tanenblatt, M., Coden, A., Sominsky, I.: The conceptmapper approach to named entity recognition. In: Calzolari, N., Choukri, K., Maegaard, B., Mariani, J., Odijk, J., Piperidis, S., Rosner, M., Tapias, D. (eds.) Proceedings of the Seventh conference on International Language Resources and Evaluation (LREC 2010). European Language Resources Association (ELRA), Valletta (2010)
Google Scholar
Torvik, V.I., Smalheiser, N.R.: A quantitative model for linking two disparate sets of articles in medline. Bioinformatics 23(13), 1658–1665 (2007)
Article Google Scholar
Tsai, R., Chou, W.-C., Su, Y.-S., Lin, Y.-C., Sung, C.-L., Dai, H.-J., Yeh, I., Ku, W., Sung, T.-Y., Hsu, W.-L.: Biosmile: A semantic role labeling system for biomedical verbs using a maximum-entropy model with automatically generated template features. BMC Bioinformatics 8(1), 325 (2007)
Article Google Scholar
USDA. Usda national nutrient database for standard reference, release 17 (2006), http://www.nal.usda.gov/fnic/foodcomp/data/
Waldschlger, J., Bergemann, C., Ruth, W., Effmert, U., Jeschke, U., Richter, D.U., Kragl, U., Piechulla, B., Briese, V.: Flax-seed extracts with phytoestrogenic effects on a hormone receptor-positive tumour cell line. Anticancer Res. 25(3A), 1817–1822 (2005)
Google Scholar
Wang, Y., Patrick Cascading, J.: classifiers for named entity recognition in clinical notes. In: Workshop Biomedical Information Extraction, pp. 42–49 (2009)
Google Scholar
Yang, C.S., Wang, X.: Green tea and cancer prevention. Nutr Cancer 62(7), 931–937 (2010)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, San Francisco State University, USA
Hui Yang, Rajesh Swaminathan, Abhishek Sharma, Vilas Ketkar & Jason D‘Silva

Authors

Hui Yang
View author publications
You can also search for this author in PubMed Google Scholar
Rajesh Swaminathan
View author publications
You can also search for this author in PubMed Google Scholar
Abhishek Sharma
View author publications
You can also search for this author in PubMed Google Scholar
Vilas Ketkar
View author publications
You can also search for this author in PubMed Google Scholar
Jason D‘Silva
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

University of New York Tirana, Rr. Komuna E Parisit,, Tirana, Albania
Marenglen Biba
Technical University of Catalonia, Campus Nord, Ed. Omega, C/Jordi Girona 1-3, 08034, Barcelona, Spain
Fatos Xhafa

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Yang, H., Swaminathan, R., Sharma, A., Ketkar, V., D‘Silva, J. (2011). Mining Biomedical Text towards Building a Quantitative Food-Disease-Gene Network. In: Biba, M., Xhafa, F. (eds) Learning Structure and Schemas from Documents. Studies in Computational Intelligence, vol 375. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-22913-8_10

Download citation

DOI: https://doi.org/10.1007/978-3-642-22913-8_10
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-22912-1
Online ISBN: 978-3-642-22913-8
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics