Biomarker Discovery with Text Mining and Literature Based Discovery

Zhu, Fei; Shen, Bairong

doi:10.1007/978-94-007-7975-4_4

Biomarker Discovery with Text Mining and Literature Based Discovery

Fei Zhu³ &
Bairong Shen³

Chapter
First Online: 26 November 2013

1814 Accesses

Part of the book series: Translational Bioinformatics ((TRBIO,volume 4))

Abstract

The huge numbers of biomedical publications provide us valuable data for research. However, how to get usable information from these integrated but unstructured biomedical is a difficult problem in front of us, which calls for biomedical text mining techniques aiming at extracting novel knowledge from scientific texts. In this chapter, we will introduce basis of text mining and examine some frequently used algorithms, tools, and data sets. With the development of systems biology, researchers tend to understand complex biomedical systems from a systems biology viewpoint. Thus, the full utilization of text mining to facilitate systems biology research is fast becoming a major concern. To address this issue, we describe the general workflow of text mining in systems biology and each phase of the workflow. Finally, we will discuss the text mining technology for research on biomarkers.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Hardcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

1.
Pubmed. http://www.ncbi.nlm.nih.gov/pubmed/.

References

Abacha AB, Zweigenbaum P. Automatic extraction of semantic relations between medical entities: a rule based approach. J Biomed Seman. 2011;2(Suppl 5):S4.
Article Google Scholar
Agarwal S, Liu F, Yu H. Simple and efficient machine learning frameworks for identifying protein–protein interaction relevant articles and experimental methods used to study the interactions. BMC Bioinform. 2011;12(Suppl 8):S10.
Article Google Scholar
Ai J, Smith B, Wong DT. Saliva Ontology: an ontology-based framework for a Salivaomics knowledge base. BMC Bioinform. 2010;11:302.
Article Google Scholar
Alexopoulos LG, et al. Construction of signaling pathways and identification of drug effects on the liver cancer cell HepG2. Conf Proc IEEE Eng Med Biol Soc. 2010;2010:6717–20.
PubMed Google Scholar
Ando M, Morita T, O’Connor SJ. Primary concerns of advanced cancer patients identified through the structured life review process: a qualitative study using a text mining technique. Palliat Support Care. 2007;5(3):265–71.
Article PubMed Google Scholar
Arighi CN, et al. Overview of the BioCreative III workshop. BMC Bioinform. 2011;12(Suppl 8):S1.
Article Google Scholar
Aronson AR, Lang FM. An overview of MetaMap: historical perspective and recent advances. J Am Med Inform Assoc. 2010;17(3):229–36.
PubMed Google Scholar
Azuaje FJ, et al. Bioinformatics as a driver, not a passenger, of translational biomedical research: perspectives from the 6th Benelux bioinformatics conference. J Clin Bioinform. 2012;2:7.
Article Google Scholar
Beuming T, et al. PDZBase: a protein–protein interaction database for PDZ-domains. Bioinformatics. 2005;21(6):827–8.
Article PubMed CAS Google Scholar
Carpenter B. Character language models for Chinese word segmentation and named entity recognition. 2006.
Google Scholar
Carpenter B. LingPipe for 99.99 % recall of gene mentions. 2007.
Google Scholar
Chandolu V, Dass CR. Cell and molecular biology underpinning the effects of PEDF on cancers in general and osteosarcoma in particular. J Biomed Biotechnol. 2012;2012:740295.
Article PubMed Google Scholar
Chang Y-C, Tsai RTH, Hsu W-L. New challenges for biological text-mining in the next decade. J Comput Sci Technol. 2010;25:169–79.
Article Google Scholar
Chatterjee S, Kumar D. Unraveling the design principle for motif organization in signaling networks. PLoS ONE. 2011;6(12):e28606.
Article PubMed CAS Google Scholar
Chaudhry Z, Siddiqui S. Health related quality of life assessment in Pakistani paediatric cancer patients using PedsQLTM 4.0 generic core scale and PedsQLTM cancer module. Health Qual Life Outcomes. 2012;10(1):52.
Article PubMed Google Scholar
Chen H, Sharp BM. Content-rich biological network constructed by mining PubMed abstracts. BMC Bioinform. 2004;5:147.
Article Google Scholar
Chlebowski RT, et al. Diabetes, metformin, and breast cancer in postmenopausal women. J Clin Oncol. 2012.
Google Scholar
Chun HW, et al. Extraction of gene–disease relations from Medline using domain dictionaries and machine learning. 2006. (Citeseer).
Google Scholar
Cohen AM, Hersh WR. A survey of current work in biomedical text mining. Brief Bioinform. 2005a;6(1):57–71.
Article PubMed CAS Google Scholar
Cohen AM, Hersh WR. A survey of current work in biomedical text mining. Brief Bioinform. 2005b;6(1):57–71.
Article PubMed CAS Google Scholar
Dagar, A, et al. Epilepsy surgery in a pediatric population: a retrospective study of 129 children from a tertiary care hospital in a developing country along with assessment of quality of life. Pediatr Neurosurg. 2011.
Google Scholar
Ephraim Y, Merhav N. Hidden markov processes. IEEE Trans Inform Theory. 2002;48(6):1518–69.
Article Google Scholar
Epstein RJ. Unblocking blockbusters: using boolean text-mining to optimise clinical trial design and timeline for novel anticancer drugs. Cancer Inform. 2009;7:231–8.
PubMed CAS Google Scholar
Eskin E, Agichtein E. Combining text mining and sequence analysis to discover protein functional regions. Pac Symp Biocomput. 2004;288–99.
Google Scholar
Foroughi F, Saadat N, Salehian MT. Encapsulated insular carcinoma of the thyroid arising in Graves’ disease: report of a case and review of the literature. Int J Surg Pathol. 2012.
Google Scholar
Franzen K, et al. Protein names and how to find them. Int J Med Inform. 2002;67(1–3):49–61.
Article PubMed Google Scholar
Frawley WJ, Piatetsky-Shapiro G, Matheus CJ. Knowledge discovery in databases: an overview. AI Mag. 1992;13:57–70.
Google Scholar
Fu W, et al. Human immunodeficiency virus type 1, human protein interaction database at NCBI. Nucleic Acids Res. 2009;37(Database issue):D417–22.
Google Scholar
Garten Y, Coulet A, Altman RB. Recent progress in automatically extracting information from the pharmacogenomic literature. Pharmacogenomics. 2010;11(10):1467–89.
Article PubMed Google Scholar
Ginter F, et al. BioInfer relationship annotation manual. 2007.
Google Scholar
Giordano CN, Sinha AA. Cytokine networks in Pemphigus vulgaris: an integrated viewpoint. Autoimmunity. 2012.
Google Scholar
Habib MS, Kalita J. Scalable biomedical named entity recognition: investigation of a database-supported SVM approach. Int J Bioinform Res Appl. 2010;6(2):191–208.
Article PubMed Google Scholar
Han K, et al. HPID: the human protein interaction database. Bioinformatics. 2004;20(15):2466–70.
Article PubMed CAS Google Scholar
Hanisch D, et al. ProMiner: rule-based protein and gene entity recognition. BMC Bioinform. 2005;6(Suppl 1):S14.
Article Google Scholar
Hassanein M, et al. The state of molecular biomarkers for the early detection of lung cancer. Cancer Prev Res (Phila). 2012.
Google Scholar
Hayasaka S, Hugenschmidt CE, Laurienti PJ. A network of genes, genetic disorders, and brain areas. PLoS ONE. 2011;6(6):e20907.
Article PubMed CAS Google Scholar
He Y, Kayaalp M. Biological entity recognition with conditional random fields. AMIA Annu Symp Proc. 2008;293–7.
Google Scholar
Hearst MA, Rosario B. Classifying the semantic relations in noun compounds via a domain-specific lexical hierarchy. In: Proceedings of 2001 conference on empirical methods in natural language processing (EMNLP 2001). Pittsburgh, PA; 2001.
Google Scholar
Hettne KM, et al. Automatic mining of the literature to generate new hypotheses for the possible link between periodontitis and atherosclerosis: lipopolysaccharide as a case study. J Clin Periodontol. 2007;34(12):1016–24.
Article PubMed CAS Google Scholar
Hjermstad MJ, et al. The EORTC QLQ-OH17: a supplementary module to the EORTC QLQ-C30 for assessment of oral health and quality of life in cancer patients. Eur J Cancer. 2012.
Google Scholar
Hoffe S, Balducci L. Cancer and age: general considerations. Clin Geriatr Med. 2012;28(1):1–18.
Article PubMed Google Scholar
Hoffmann R, Valencia A. A gene network for navigating the literature. Nat Genet. 2004;36(7):664.
Article PubMed CAS Google Scholar
Hoffmann R, Valencia A. Implementing the iHOP concept for navigation of biomedical literature. Bioinformatics. 2005;21(Suppl 2):ii252–8.
Article PubMed CAS Google Scholar
Jemal A, Bray F, Center MM, Ferlay J, Ward E, Forman D. Global cancer statistics. CA Cancer J Clin. 2011;61(2):69–90.
Article PubMed Google Scholar
Jensen LJ, et al. STRING 8—a global view on proteins and their functional interactions in 630 organisms. Nucleic Acids Res. 2009;37(Database issue):D412–6.
Google Scholar
Jenssen TK, et al. A literature network of human genes for high-throughput analysis of gene expression. Nat Genet. 2001;28(1):21–8.
PubMed CAS Google Scholar
Johnson HL, et al. Corpus refactoring: a feasibility study. J Biomed Discov Collab. 2007;2:4.
Article PubMed Google Scholar
Kazama J, Makino T, Ohta Y, Tsujii J. Tuning support vector machines for biomedical named entity recognition. In: Association for computational linguistics. NJ, USA; 2002.
Google Scholar
Kerrien S, et al. IntAct—open source resource for molecular interaction data. Nucleic Acids Res. 2007;35(Database issue):D561–5.
Google Scholar
Khoshnevisan A, et al. Translation and validation of the EORTC brain cancer module (EORTC QLQ-BN20) for use in Iran. Health Qual Life Outcomes. 2012;10(1):54.
Article PubMed Google Scholar
Kim JD, et al. GENIA corpus—semantically annotated corpus for bio-textmining. Bioinformatics. 2003;19(Suppl 1):i180–2.
Article PubMed Google Scholar
Korhonen A, et al. Text mining for literature review and knowledge discovery in cancer risk assessment and research. PLoS ONE. 2012;7(4):e33427.
Article PubMed CAS Google Scholar
Kountourakis P, et al. Barrett’s esophagus: a review of biology and therapeutic approaches. Gastrointest Cancer Res. 2012;5(2):49–57.
PubMed Google Scholar
Krallinger M, et al. The protein–protein interaction tasks of BioCreative III: classification/ranking of articles and linking bio-ontology concepts to full text. BMC Bioinform. 2011;12(Suppl 8):S3.
Article Google Scholar
Leitner F, et al. Introducing meta-services for biomedical information extraction. Genome Biol. 2008;9(Suppl 2):S6.
Article PubMed Google Scholar
Leser U, Hakenberg J. What makes a gene name? Named entity recognition in the biomedical literature. Brief Bioinform. 2005;6(4):357–69.
Article PubMed CAS Google Scholar
Li H, Liu C. Biomarker identification using text mining. Comput Math Methods Med. 2012;2012:135780.
PubMed Google Scholar
Li L, Zhou R, Huang D. Two-phase biomedical named entity recognition using CRFs. Comput Biol Chem. 2009a;33(4):334–8.
Article PubMed CAS Google Scholar
Li J, Zhu X, Chen JY. Building disease-specific drug-protein connectivity maps from molecular interaction networks and PubMed abstracts. PLoS Comput Biol. 2009b;5(7):e1000450.
Article PubMed Google Scholar
Li X, et al. A mouse protein interactome through combined literature mining with multiple sources of interaction evidence. Amino Acids. 2010;38(4):1237–52.
Article PubMed CAS Google Scholar
Liekens AM, et al. BioGraph: unsupervised biomedical knowledge discovery via automated hypothesis generation. Genome Biol. 2011;12(6):R57.
Article PubMed Google Scholar
Lin YF. BIOKDD04: 4th workshop on data mining in bioinformatics (with SIGKDD conference). In: A maximum entropy approach to biomedical named entity recognition; 2004.
Google Scholar
Liu KQ, et al. Identifying dysregulated pathways in cancers from pathway interaction networks. BMC Bioinform. 2012;13(1):126.
Article Google Scholar
Logue JS, Morrison DK. Complexity in the signaling network: insights from the use of targeted inhibitors in cancer therapy. Genes Dev. 2012;26(7):641–50.
Article PubMed CAS Google Scholar
Macilwain C. Systems biology: evolving into the mainstream. Cell. 2011;144(6):839–41.
Article PubMed CAS Google Scholar
Mack R, Hehenberger M. Text-based knowledge discovery: search and mining of life-sciences documents. Drug Discov Today. 2012;7:89–98.
Article Google Scholar
Matos S, et al. Concept-based query expansion for retrieving gene related publications from MEDLINE. BMC Bioinform. 2010;11:212.
Article Google Scholar
Mattila J, et al. Design and application of a generic clinical decision support system for multiscale data. IEEE Trans Biomed Eng. 2012;59(1):234–40.
Article PubMed Google Scholar
McEntyre J, Lipman D. PubMed: bridging the information gap. CMAJ. 2001;164(9):1317–9.
PubMed CAS Google Scholar
Nam S, Park T. Pathway-based evaluation in early onset colorectal cancer suggests focal adhesion and immunosuppression along with epithelial-mesenchymal transition. PLoS One. 2012;7.
Google Scholar
Novichkova S, Egorov S, Daraselia N. MedScan, a natural language processing engine for MEDLINE abstracts. Bioinformatics. 2003;19(13):1699–706.
Article PubMed CAS Google Scholar
Okazaki N, Ananiadou S. Building an abbreviation dictionary using a term recognition approach. Bioinformatics. 2006;22(24):3089–95.
Article PubMed CAS Google Scholar
Papp B, Notebaart RA, Pal C. Systems-biology approaches for predicting genomic evolution. Nat Rev Genet. 2011;12(9):591–602.
Article PubMed CAS Google Scholar
Peri S, et al. Development of human protein reference database as an initial platform for approaching systems biology in humans. Genome Res. 2003;13(10):2363–71.
Article PubMed CAS Google Scholar
Pinney JW, et al. HIV-host interactions: a map of viral perturbation of the host system. AIDS. 2009;23(5):549–54.
PubMed Google Scholar
Prasad TSK, et al. Human protein reference database—2009 update. Nucleic Acids Res. 2009;37(Database issue):D767–72.
Google Scholar
Ptak RG, et al. Cataloguing the HIV type 1 human protein interaction network. AIDS Res Hum Retroviruses. 2008;24(12):1497–502.
Article PubMed CAS Google Scholar
Pyysalo S, et al. BioInfer: a corpus for information extraction in the biomedical domain. BMC Bioinform. 2007;8:50.
Article Google Scholar
Qabaja A, Alshalalfa M, Bismar TA, Alhajj R. Protein network-based Lasso regression model for the construction of disease-miRNA functional interactions. EURASIP J Bioinform Syst Biol. 2013;1:3.
Article Google Scholar
Ramasubbu R, et al. The Canadian network for mood and anxiety treatments (CANMAT) task force recommendations for the management of patients with mood disorders and select comorbid medical conditions. Ann Clin Psychiatry. 2012;24(1):91–109.
PubMed Google Scholar
Raychaudhuri S, Altman RB. A literature-based method for assessing the functional coherence of a gene group. Bioinformatics. 2003;19(3):396–401.
Article PubMed CAS Google Scholar
Raychaudhuri S, Schutze H, Altman RB. Using text analysis to identify functionally coherent gene groups. Genome Res. 2002;12(10):1582–90.
Article PubMed CAS Google Scholar
Rebholz-Schuhmann D, et al. Assessment of NER solutions against the first and second CALBC silver standard corpus. J Biomed Seman. 2011;2(Suppl 5):S11.
Article Google Scholar
Rosario B, Hearst MA. Multi-way relation classification: application to protein–protein interactions. 2005.
Google Scholar
Rosario B, Hearst MA. Classifying semantic relations in bioscience text. In: Proceedings of the 42nd annual meeting of the association for computational linguistics (ACL 2004). Barcelona; 2004.
Google Scholar
Rosario B, Hearst MA. Classifying semantic relations in bioscience texts. 2004.
Google Scholar
Rosario B, Hearst MA. Multi-way relation classification: application to protein–protein interaction. In: HLT-NAACL’05. Vancouver; 2005.
Google Scholar
Sasaki Y, et al. How to make the most of NE dictionaries in statistical NER. BMC Bioinform. 2008;9(Suppl 11):S5.
Article Google Scholar
Schwartz AS, Hearst MA. A simple algorithm for identifying abbreviation definitions in biomedical text. Pac Symp Biocomput. 2003;451–62.
Google Scholar
Settles B. ABNER: an open source tool for automatically tagging genes, proteins and other entity names in text. Bioinformatics. 2005;21(14):3191–2.
Article PubMed CAS Google Scholar
Sharma P, et al. Mining literature for a comprehensive pathway analysis: a case study for retrieval of homocysteine related genes for genetic and epigenetic studies. Lipids Health Dis. 2006;5:1.
Article PubMed Google Scholar
Staiger C, et al. A critical evaluation of network and pathway-based classifiers for outcome prediction in breast cancer. PLoS ONE. 2012;7(4):e34796.
Article PubMed CAS Google Scholar
Swanson DR. Fish oil, Raynaud’s syndrome, and undiscovered public knowledge. Perspect Biol Med. 1986;30:7–18.
PubMed CAS Google Scholar
Tanabe L, et al. GENETAG: a tagged corpus for gene/protein named entity recognition. BMC Bioinform. 2005;6(Suppl 1):S3.
Article Google Scholar
Thompson P, et al. The BioLexicon: a large-scale terminological resource for biomedical text mining. BMC Bioinform. 2011;12:397.
Article Google Scholar
Topinka CM, Shyu C. Predicting cancer interaction networks using text-mining and structure understanding. In: AMIA annual symposium proceeding. 2006.
Google Scholar
Trugenberger CA, et al. Discovery of novel biomarkers and phenotypes by semantic technologies. BMC Bioinform. 2013;14(51):51.
Article Google Scholar
Tsai FS. Text mining and visualisation of protein–protein interactions. Int J Comput Biol Drug Des. 2011;4(3):239–44.
Article PubMed CAS Google Scholar
Tsai T, et al. Integrating linguistic knowledge into a conditional random fieldframework to identify biomedical named entities. Expert Syst Appl. 2006;30(1):117–28.
Article Google Scholar
Tsuruoka Y, Tsujii J. Bidirectional inference with the easiest-first strategy for tagging sequence data. In: Association for computational linguistics Morristown, NJ, USA; 2005.
Google Scholar
Turenne N, Tiys E, Ivanisenko V, Yudin N, Ignatieva E, Valour D, Degrelle SA, Hue I. Finding biomarkers in non-model species: literature mining of transcription factors involved in bovine embryo development. BioData Min. 2012;5(12):1–12.
Google Scholar
Urzua U, Owens G, Zhang GM, Cherry JM, Sharp JJ. Tumor and reproductive traits are linked by RNA metabolism genes in the mouse ovary: a transcriptome-phenotype association analysis. BMC Genomics. 2010;11.
Google Scholar
Vastrik I, et al. Reactome: a knowledge base of biologic pathways and processes. Genome Biol. 2007;8(3):R39.
Article PubMed Google Scholar
Vastrik I, et al. Correction: Reactome: a knowledge base of biologic pathways and processes. Genome Biol. 2009;10(2):402.
Article Google Scholar
Wang B. BRCA1 tumor suppressor network: focusing on its tail. Cell Biosci. 2012;2(1):6.
Article PubMed Google Scholar
Wei MY, Giovannucci EL. Lycopene, tomato products, and prostate cancer incidence: a review and reassessment in the PSA screening era. J Oncol. 2012;2012:271063.
Article PubMed Google Scholar
Wren JD, Garner HR. Shared relationship analysis: ranking set cohesion and commonalities within a literature-derived relationship network. Bioinformatics. 2004;20(2):191–8.
Article PubMed CAS Google Scholar
Yang Y, Adelstein S, Kassis AI. Target discovery from data mining approaches. Drug Discov Today. 2012;17.
Google Scholar
Zhou GD, Su J. Exploring deep knowledge resources in biomedical name recognition. In: JNLPBA; 2004.
Google Scholar
Zhu F, Shen B. Combined SVM-CRFs for biological named entity recognition with maximal bidirectional squeezing. PLoS ONE. 2012;7(8):1–8.
Google Scholar
Zhu F, Patumcharoenpol P, Zhang C, Yang Y, Chan J, Meechai A, Vongsangnak W, Shen B. Biomedical text mining and its applications in cancer research. J Biomed Inform. 2013;46(2):200–11.
Article PubMed Google Scholar

Download references

Author information

Authors and Affiliations

Center for Systems Biology, Soochow University, No. 1. Shizi Street, Suzhou, P.O. Box 206, 215006, Jiangsu, China
Fei Zhu & Bairong Shen

Authors

Fei Zhu
View author publications
You can also search for this author in PubMed Google Scholar
Bairong Shen
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Bairong Shen .

Editor information

Editors and Affiliations

Center for Systems Biology, Soochow University, Suzhou, People's Republic of China
Bairong Shen

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Zhu, F., Shen, B. (2013). Biomarker Discovery with Text Mining and Literature Based Discovery. In: Shen, B. (eds) Bioinformatics for Diagnosis, Prognosis and Treatment of Complex Diseases. Translational Bioinformatics, vol 4. Springer, Dordrecht. https://doi.org/10.1007/978-94-007-7975-4_4

Download citation

DOI: https://doi.org/10.1007/978-94-007-7975-4_4
Published: 26 November 2013
Publisher Name: Springer, Dordrecht
Print ISBN: 978-94-007-7974-7
Online ISBN: 978-94-007-7975-4
eBook Packages: Biomedical and Life SciencesBiomedical and Life Sciences (R0)

Publish with us

Policies and ethics