Abstract
Advances in bio-technology and life sciences are leading to an ever-increasing volume of published research data, predominantly in unstructured text. To uncover the underlying knowledge base hidden in such data, text mining techniques have been utilized. Past and current efforts in this area have been largely focusing on recognizing gene and protein names, and identifying binary relationships among genes or proteins. In this chapter, we present an information extraction system that analyzes publications in an emerging discipline–Nutritional Genomics, a discipline that studies the interactions amongst genes, foods and diseases–aiming to build a quantitative food-disease-gene network. To this end, we adopt a host of techniques including natural language processing (NLP) techniques, domain ontology, and machine learning approaches.
Specifically, the proposed system is composed of four main modules: (1) named entity recognition, which extracts five types of entities including foods, chemicals, diseases, proteins and genes; (2) relationship extraction: A verb-centric approach is implemented to extract binary relationships between two entities; (3) relationship polarity and strength analysis: We have constructed novel features to capture the syntactic, semantic and structural aspects of a relationship. A 2-phase Support Vector Machine is then used to classify the polarity, whereas a Support Vector Regression learner is applied to rate the strength level of a relationship; and (4) relationship integration and visualization, which integrates the previously extracted relationships and realizes a preliminary user interface for intuitive observation and exploration.
Empirical evaluations of the first three modules demonstrate the efficacy of this system. The entity recognition module achieved a balanced precision and recall with an average F-score of 0.89. The average F-score of the extracted relationships is 0.905. Finally, an accuracy of 0.91 and 0.96 was achieved in classifying the relationship polarity and rating the relationship strength level, respectively..
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Aronson, A.R.: Effective mapping of biomedical text to the umls metathesaurus: The metamap program (2001)
Carlson, A., Betteridge, J., Wang, R.C., Hruschka Jr., E.R., Mitchell, T.M.: Coupled semi-supervised learning for information extraction. In: WSDM 2010: Proceedings of the Third ACM International Conference on Web Search and Data Mining, pp. 101–110. ACM, New York (2010)
Corney, D.P., Buxton, B.F., Langdon, W.B., Jones, D.T.: Biorat: extracting biological information from full-length papers. Bioinformatics 20(17), 3206–3213 (2004)
de Marneffe, M.-C., MacCartney, B., Manning, C. D.: Generating typed dependency parses from phrase structure trees. In: LREC (2006)
Denecke, K.: Semantic structuring of and information extraction from medical documents using the umls. Methods of Information in Medicine 47(5), 425–434 (2008)
Ding, X., Liu, B., Yu, P.S.: A holistic lexicon-based approach to opinion mining. In: Proceedings of First ACM International Conference on Web Search and Data Mining, WSDM 2008 (2008)
Feldman, R., Regev, Y., Finkelstein-Landau, M., Hurvitz, E., Kogan, B.: Mining biomedical literature using information extraction. Current Drug Discovery (2002)
Fellbaum, C.: WordNet: An Electronic Lexical Database. MIT Press, Cambridge (1998)
Fukuda, K., Tsunoda, T., Tamura, A., Takagi, T.: Toward information extraction: Identifying protein names from biological papers. In: Proceedings of the Pacific Symposium on Biocomputing (1998)
Fundel, K., Kffner, R., Zimmer, R.: Relex - relation extraction using dependency parse trees. Bioinformatics 23(3), 365–371 (2007)
Guyon, I., Elisseeff, A.: An introduction to variable and feature selection. Journal of Machine Learning Research 3, 1157–1182 (2003)
Hakenberg, J., Leaman, R., Vo, N.H., Jonnalagadda, S., Sullivan, R., Miller, C., Tari, L., Baral, C., Gonzalez, G.: Efficient extraction of protein-protein interactions from full-text articles. IEEE/ACM Trans. Comput. Biology Bioinform. 7(3), 481–494 (2010)
Heer, J., Card, S.K., Landay, J.A.: Prefuse: a toolkit for interactive information visualization. In: Proc. CHI 2005, Human Factors in Computing Systems (2005)
Hoffmann, R., Valencia, A.: A gene network for navigating the literature. Nature Genetics 36, 664 (2004)
Hu, X., Wu, D.D.: Data mining and predictive modeling of biomolecular network from biomedical literature databases. IEEE/ACM Trans. Comp. Biol. Bioinf. 4(2), 251–263 (2006)
Hur, J., Schuyler, A.D., States, D.J., Feldman, E.L.: Sciminer: web-based literature mining tool for target identification and functional enrichment analysis. Bioinformatics 840, 838–840 (2009)
Kim, J.j., Zhang, Z., Park, J.C., Ng, S.-K.: Biocontrasts: Extracting and exploiting protein-protein contrastive relations from biomedical literature. Bioinformatics 22(5), 597–605 (2006)
Jensen, L.J., Kuhn, M., Stark, M., Chaffron, S., Creevey, C., Muller, J., Doerks, T., Julien, P., Roth, A., Simonovic, M., Bork, P., von Mering, C.: String 8–a global view on proteins and their functional interactions in 630 organisms. Nucleic acids research 37(database issue), 412–416 (2009)
Joachims, T.: Text categorization with support vector machines: learning with many relevant features. In: Nédellec, C., Rouveirol, C. (eds.) ECML 1998. LNCS, vol. 1398, Springer, Heidelberg (1998)
Kaput, J., Rodriguez, R.: Nutritional genomics: the next frontier in the postgenomic era. Physiological Genomics 16, 166–177 (2004)
Kipper, K., Korhonen, A., Ryant, N., Palmer, M.: Extending verbnet with novel verb classes. In: Proceedings of the Fifth International Conference on Language Resources and Evaluation LREC (January 2006), http://verbs.colorado.edu/~mpalmer/projects/verbnet.html
Kobayashi, N., Iida, R., Inui, K., Matsumoto, Y.: Opinion mining on the web by extracting subject-attribute-value relations. In: Proceedings of AAAI 2006 Spring Sympoia on Computational Approaches to Analyzing Weblogs AAAI-CAAW 2006 (2006)
Krallinger, M., Morgan, A., Smith, L., Leitner, F., Tanabe, L., Wilbur, J., Hirschman, L., Valencia, A.: Evaluation of text-mining systems for biology: overview of the second biocreative community challenge. Genome Biol. 9(suppl. 2), S1 (2008)
Kuhn, M., Szklarczyk, D., Franceschini, A., Campillos, M., von Mering, C., Jensen, L.J., Beyer, A., Bork, P.: Stitch 2: an interaction network database for small molecules and proteins. Nucleic Acids Research, 1–5 (2009)
Lindsay, R.K., Gordon, M.D.: Literature-based discovery by lexical statistics. Journal of the American Society for Information Science 50(7), 574–587 (1999)
Liu, B.: Sentiment analysis and subjectivity. In: Handbook of Natural Language Processing, 2nd edn. (2010)
Maskarinec, G., Morimoto, Y., Novotny, R., Nordt, F.J., Stanczyk, F.Z., Franke, A.A.: Urinary sex steroid excretion levels during a soy intervention among young girls: a pilot study. Nut. Can. 52(1), 22–28 (2005)
McDonald, R., Hannan, K., Neylon, T., Wells, M., Reynar, J.: Structured models for fine-to-coarse sentiment analysis. In: Proceedings of the Association for Computational Linguistics (ACL), pp. 432–439. Association for Computational Linguistics, Prague (2007)
MEDLINE. Medline (1999), http://www.nlm.nih.gov/databases/databases_medline.html
Mei, Q., Ling, X., Wondra, M., Su, H., Zhai, C.X.: Topic sentiment mixture: Modeling facets and opinions in weblogs. In: Proceedings of WWW, pp. 171–180. ACM Press, New York (2007)
Müller, H.-M., Kenny, E.E., Sternberg, P.W.: Textpresso: an ontology-based information retrieval and extraction system for biological literature. PLoS Biol. 2(11), e309 (2004)
Niu, Y., Zhu, X., Li, J., Hirst, G.: Analysis of polarity information in medical text. In: Proceedings of the American Medical Informatics Association (2005)
Ohta, T., Tateisi, Y., Kim, J.D.: The genia corpus: An annotated research abstract corpus in molecular biology domain. In: The Human Language Technology Conference (2002)
Ohta, T., Tsuruoka, Y., Tateisi, Y.: Introduction to the bioentity recognition task at jnlpba. In: Kim, J.-D. (ed.) Proc. International Joint Workshop on Natural Language Processing in Biomedicine its Applications (2004)
OpenSource. Opennlp (2010), http://opennlp.sourceforge.net/
Pang, B., Lee, L., Vaithyanathan, S.: Thumbs up? sentiment classification using machine learning techniques. In: EMNLP 2002: Proceedings of the ACL 2002 Conference on Empirical Methods in Natural Language Processing, pp. 79–86. Association for Computational Linguistics, Morristown (2002)
Sahay, S., Mukherjea, S., Agichtein, E., Garcia, E.V., Navathe, S.B., Ram, A.: Discovering semantic biomedical relations utilizing the web. ACM Trans. Knowl. Discov. Data 2(1), 1–15 (2008)
Sauvaget, C., Lagarde, F.: Lifestyle factors, radiation and gastric cancer in atomic-bomb survivors in japan. Cancer Causes Control 16(7), 773–780 (2005)
Settles, B.: ABNER: An open source tool for automatically tagging genes, proteins, and other entity names in text. Bioinformatics 21(14), 3191–3192 (2005)
Sharma, A., Swaminathan, R., Yang, H.: A verb-centric approach for relationship extraction in biomedical text. In: The fourth IEEE International Conference on Semantic Computing, ICSC 2010 (September 2010)
Smalheiser, N.R., Swanson, D.R.: Using arrowsmith: a computer-assisted approach to formulating and assessing scientific hypotheses. Computer Methods Programs in Biomedicine 57, 149–153 (1998)
Srinivasan, P.: Text mining: Generating hypotheses from medline. J. Amer. Soc. Inf. Sci. Tech. 55(5), 396–413 (2004)
Swaminathan, R., Sharma, A., Yang, H.: Opinion mining for biomedical text data: Feature space design and feature selection. In: The Nineth International Workshop on Data Mining in Bioinformatics, BIOKDD 2010 (July 2010)
Swanson, D.R.: Fish oil, raynauds syndrome, and undiscovered public knowledge. Perspect. Bio. Med. 30, 7–18 (1986)
Tanabe, L.K., John Wilbur, W.: Tagging gene and protein names in full text articles. In: ACL Workshop on Nat. Lang. Proc. in the Biomedical Domain, pp. 9–13 (2002)
Tanenblatt, M., Coden, A., Sominsky, I.: The conceptmapper approach to named entity recognition. In: Calzolari, N., Choukri, K., Maegaard, B., Mariani, J., Odijk, J., Piperidis, S., Rosner, M., Tapias, D. (eds.) Proceedings of the Seventh conference on International Language Resources and Evaluation (LREC 2010). European Language Resources Association (ELRA), Valletta (2010)
Torvik, V.I., Smalheiser, N.R.: A quantitative model for linking two disparate sets of articles in medline. Bioinformatics 23(13), 1658–1665 (2007)
Tsai, R., Chou, W.-C., Su, Y.-S., Lin, Y.-C., Sung, C.-L., Dai, H.-J., Yeh, I., Ku, W., Sung, T.-Y., Hsu, W.-L.: Biosmile: A semantic role labeling system for biomedical verbs using a maximum-entropy model with automatically generated template features. BMC Bioinformatics 8(1), 325 (2007)
USDA. Usda national nutrient database for standard reference, release 17 (2006), http://www.nal.usda.gov/fnic/foodcomp/data/
Waldschlger, J., Bergemann, C., Ruth, W., Effmert, U., Jeschke, U., Richter, D.U., Kragl, U., Piechulla, B., Briese, V.: Flax-seed extracts with phytoestrogenic effects on a hormone receptor-positive tumour cell line. Anticancer Res. 25(3A), 1817–1822 (2005)
Wang, Y., Patrick Cascading, J.: classifiers for named entity recognition in clinical notes. In: Workshop Biomedical Information Extraction, pp. 42–49 (2009)
Yang, C.S., Wang, X.: Green tea and cancer prevention. Nutr Cancer 62(7), 931–937 (2010)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2011 Springer-Verlag Berlin Heidelberg
About this chapter
Cite this chapter
Yang, H., Swaminathan, R., Sharma, A., Ketkar, V., D‘Silva, J. (2011). Mining Biomedical Text towards Building a Quantitative Food-Disease-Gene Network. In: Biba, M., Xhafa, F. (eds) Learning Structure and Schemas from Documents. Studies in Computational Intelligence, vol 375. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-22913-8_10
Download citation
DOI: https://doi.org/10.1007/978-3-642-22913-8_10
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-22912-1
Online ISBN: 978-3-642-22913-8
eBook Packages: EngineeringEngineering (R0)