Skip to main content

Biomarker Discovery with Text Mining and Literature Based Discovery

  • Chapter
  • First Online:
  • 1814 Accesses

Part of the book series: Translational Bioinformatics ((TRBIO,volume 4))

Abstract

The huge numbers of biomedical publications provide us valuable data for research. However, how to get usable information from these integrated but unstructured biomedical is a difficult problem in front of us, which calls for biomedical text mining techniques aiming at extracting novel knowledge from scientific texts. In this chapter, we will introduce basis of text mining and examine some frequently used algorithms, tools, and data sets. With the development of systems biology, researchers tend to understand complex biomedical systems from a systems biology viewpoint. Thus, the full utilization of text mining to facilitate systems biology research is fast becoming a major concern. To address this issue, we describe the general workflow of text mining in systems biology and each phase of the workflow. Finally, we will discuss the text mining technology for research on biomarkers.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Hardcover Book
USD   109.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    Pubmed. http://www.ncbi.nlm.nih.gov/pubmed/.

References

  • Abacha AB, Zweigenbaum P. Automatic extraction of semantic relations between medical entities: a rule based approach. J Biomed Seman. 2011;2(Suppl 5):S4.

    Article  Google Scholar 

  • Agarwal S, Liu F, Yu H. Simple and efficient machine learning frameworks for identifying protein–protein interaction relevant articles and experimental methods used to study the interactions. BMC Bioinform. 2011;12(Suppl 8):S10.

    Article  Google Scholar 

  • Ai J, Smith B, Wong DT. Saliva Ontology: an ontology-based framework for a Salivaomics knowledge base. BMC Bioinform. 2010;11:302.

    Article  Google Scholar 

  • Alexopoulos LG, et al. Construction of signaling pathways and identification of drug effects on the liver cancer cell HepG2. Conf Proc IEEE Eng Med Biol Soc. 2010;2010:6717–20.

    PubMed  Google Scholar 

  • Ando M, Morita T, O’Connor SJ. Primary concerns of advanced cancer patients identified through the structured life review process: a qualitative study using a text mining technique. Palliat Support Care. 2007;5(3):265–71.

    Article  PubMed  Google Scholar 

  • Arighi CN, et al. Overview of the BioCreative III workshop. BMC Bioinform. 2011;12(Suppl 8):S1.

    Article  Google Scholar 

  • Aronson AR, Lang FM. An overview of MetaMap: historical perspective and recent advances. J Am Med Inform Assoc. 2010;17(3):229–36.

    PubMed  Google Scholar 

  • Azuaje FJ, et al. Bioinformatics as a driver, not a passenger, of translational biomedical research: perspectives from the 6th Benelux bioinformatics conference. J Clin Bioinform. 2012;2:7.

    Article  Google Scholar 

  • Beuming T, et al. PDZBase: a protein–protein interaction database for PDZ-domains. Bioinformatics. 2005;21(6):827–8.

    Article  PubMed  CAS  Google Scholar 

  • Carpenter B. Character language models for Chinese word segmentation and named entity recognition. 2006.

    Google Scholar 

  • Carpenter B. LingPipe for 99.99 % recall of gene mentions. 2007.

    Google Scholar 

  • Chandolu V, Dass CR. Cell and molecular biology underpinning the effects of PEDF on cancers in general and osteosarcoma in particular. J Biomed Biotechnol. 2012;2012:740295.

    Article  PubMed  Google Scholar 

  • Chang Y-C, Tsai RTH, Hsu W-L. New challenges for biological text-mining in the next decade. J Comput Sci Technol. 2010;25:169–79.

    Article  Google Scholar 

  • Chatterjee S, Kumar D. Unraveling the design principle for motif organization in signaling networks. PLoS ONE. 2011;6(12):e28606.

    Article  PubMed  CAS  Google Scholar 

  • Chaudhry Z, Siddiqui S. Health related quality of life assessment in Pakistani paediatric cancer patients using PedsQLTM 4.0 generic core scale and PedsQLTM cancer module. Health Qual Life Outcomes. 2012;10(1):52.

    Article  PubMed  Google Scholar 

  • Chen H, Sharp BM. Content-rich biological network constructed by mining PubMed abstracts. BMC Bioinform. 2004;5:147.

    Article  Google Scholar 

  • Chlebowski RT, et al. Diabetes, metformin, and breast cancer in postmenopausal women. J Clin Oncol. 2012.

    Google Scholar 

  • Chun HW, et al. Extraction of gene–disease relations from Medline using domain dictionaries and machine learning. 2006. (Citeseer).

    Google Scholar 

  • Cohen AM, Hersh WR. A survey of current work in biomedical text mining. Brief Bioinform. 2005a;6(1):57–71.

    Article  PubMed  CAS  Google Scholar 

  • Cohen AM, Hersh WR. A survey of current work in biomedical text mining. Brief Bioinform. 2005b;6(1):57–71.

    Article  PubMed  CAS  Google Scholar 

  • Dagar, A, et al. Epilepsy surgery in a pediatric population: a retrospective study of 129 children from a tertiary care hospital in a developing country along with assessment of quality of life. Pediatr Neurosurg. 2011.

    Google Scholar 

  • Ephraim Y, Merhav N. Hidden markov processes. IEEE Trans Inform Theory. 2002;48(6):1518–69.

    Article  Google Scholar 

  • Epstein RJ. Unblocking blockbusters: using boolean text-mining to optimise clinical trial design and timeline for novel anticancer drugs. Cancer Inform. 2009;7:231–8.

    PubMed  CAS  Google Scholar 

  • Eskin E, Agichtein E. Combining text mining and sequence analysis to discover protein functional regions. Pac Symp Biocomput. 2004;288–99.

    Google Scholar 

  • Foroughi F, Saadat N, Salehian MT. Encapsulated insular carcinoma of the thyroid arising in Graves’ disease: report of a case and review of the literature. Int J Surg Pathol. 2012.

    Google Scholar 

  • Franzen K, et al. Protein names and how to find them. Int J Med Inform. 2002;67(1–3):49–61.

    Article  PubMed  Google Scholar 

  • Frawley WJ, Piatetsky-Shapiro G, Matheus CJ. Knowledge discovery in databases: an overview. AI Mag. 1992;13:57–70.

    Google Scholar 

  • Fu W, et al. Human immunodeficiency virus type 1, human protein interaction database at NCBI. Nucleic Acids Res. 2009;37(Database issue):D417–22.

    Google Scholar 

  • Garten Y, Coulet A, Altman RB. Recent progress in automatically extracting information from the pharmacogenomic literature. Pharmacogenomics. 2010;11(10):1467–89.

    Article  PubMed  Google Scholar 

  • Ginter F, et al. BioInfer relationship annotation manual. 2007.

    Google Scholar 

  • Giordano CN, Sinha AA. Cytokine networks in Pemphigus vulgaris: an integrated viewpoint. Autoimmunity. 2012.

    Google Scholar 

  • Habib MS, Kalita J. Scalable biomedical named entity recognition: investigation of a database-supported SVM approach. Int J Bioinform Res Appl. 2010;6(2):191–208.

    Article  PubMed  Google Scholar 

  • Han K, et al. HPID: the human protein interaction database. Bioinformatics. 2004;20(15):2466–70.

    Article  PubMed  CAS  Google Scholar 

  • Hanisch D, et al. ProMiner: rule-based protein and gene entity recognition. BMC Bioinform. 2005;6(Suppl 1):S14.

    Article  Google Scholar 

  • Hassanein M, et al. The state of molecular biomarkers for the early detection of lung cancer. Cancer Prev Res (Phila). 2012.

    Google Scholar 

  • Hayasaka S, Hugenschmidt CE, Laurienti PJ. A network of genes, genetic disorders, and brain areas. PLoS ONE. 2011;6(6):e20907.

    Article  PubMed  CAS  Google Scholar 

  • He Y, Kayaalp M. Biological entity recognition with conditional random fields. AMIA Annu Symp Proc. 2008;293–7.

    Google Scholar 

  • Hearst MA, Rosario B. Classifying the semantic relations in noun compounds via a domain-specific lexical hierarchy. In: Proceedings of 2001 conference on empirical methods in natural language processing (EMNLP 2001). Pittsburgh, PA; 2001.

    Google Scholar 

  • Hettne KM, et al. Automatic mining of the literature to generate new hypotheses for the possible link between periodontitis and atherosclerosis: lipopolysaccharide as a case study. J Clin Periodontol. 2007;34(12):1016–24.

    Article  PubMed  CAS  Google Scholar 

  • Hjermstad MJ, et al. The EORTC QLQ-OH17: a supplementary module to the EORTC QLQ-C30 for assessment of oral health and quality of life in cancer patients. Eur J Cancer. 2012.

    Google Scholar 

  • Hoffe S, Balducci L. Cancer and age: general considerations. Clin Geriatr Med. 2012;28(1):1–18.

    Article  PubMed  Google Scholar 

  • Hoffmann R, Valencia A. A gene network for navigating the literature. Nat Genet. 2004;36(7):664.

    Article  PubMed  CAS  Google Scholar 

  • Hoffmann R, Valencia A. Implementing the iHOP concept for navigation of biomedical literature. Bioinformatics. 2005;21(Suppl 2):ii252–8.

    Article  PubMed  CAS  Google Scholar 

  • Jemal A, Bray F, Center MM, Ferlay J, Ward E, Forman D. Global cancer statistics. CA Cancer J Clin. 2011;61(2):69–90.

    Article  PubMed  Google Scholar 

  • Jensen LJ, et al. STRING 8—a global view on proteins and their functional interactions in 630 organisms. Nucleic Acids Res. 2009;37(Database issue):D412–6.

    Google Scholar 

  • Jenssen TK, et al. A literature network of human genes for high-throughput analysis of gene expression. Nat Genet. 2001;28(1):21–8.

    PubMed  CAS  Google Scholar 

  • Johnson HL, et al. Corpus refactoring: a feasibility study. J Biomed Discov Collab. 2007;2:4.

    Article  PubMed  Google Scholar 

  • Kazama J, Makino T, Ohta Y, Tsujii J. Tuning support vector machines for biomedical named entity recognition. In: Association for computational linguistics. NJ, USA; 2002.

    Google Scholar 

  • Kerrien S, et al. IntAct—open source resource for molecular interaction data. Nucleic Acids Res. 2007;35(Database issue):D561–5.

    Google Scholar 

  • Khoshnevisan A, et al. Translation and validation of the EORTC brain cancer module (EORTC QLQ-BN20) for use in Iran. Health Qual Life Outcomes. 2012;10(1):54.

    Article  PubMed  Google Scholar 

  • Kim JD, et al. GENIA corpus—semantically annotated corpus for bio-textmining. Bioinformatics. 2003;19(Suppl 1):i180–2.

    Article  PubMed  Google Scholar 

  • Korhonen A, et al. Text mining for literature review and knowledge discovery in cancer risk assessment and research. PLoS ONE. 2012;7(4):e33427.

    Article  PubMed  CAS  Google Scholar 

  • Kountourakis P, et al. Barrett’s esophagus: a review of biology and therapeutic approaches. Gastrointest Cancer Res. 2012;5(2):49–57.

    PubMed  Google Scholar 

  • Krallinger M, et al. The protein–protein interaction tasks of BioCreative III: classification/ranking of articles and linking bio-ontology concepts to full text. BMC Bioinform. 2011;12(Suppl 8):S3.

    Article  Google Scholar 

  • Leitner F, et al. Introducing meta-services for biomedical information extraction. Genome Biol. 2008;9(Suppl 2):S6.

    Article  PubMed  Google Scholar 

  • Leser U, Hakenberg J. What makes a gene name? Named entity recognition in the biomedical literature. Brief Bioinform. 2005;6(4):357–69.

    Article  PubMed  CAS  Google Scholar 

  • Li H, Liu C. Biomarker identification using text mining. Comput Math Methods Med. 2012;2012:135780.

    PubMed  Google Scholar 

  • Li L, Zhou R, Huang D. Two-phase biomedical named entity recognition using CRFs. Comput Biol Chem. 2009a;33(4):334–8.

    Article  PubMed  CAS  Google Scholar 

  • Li J, Zhu X, Chen JY. Building disease-specific drug-protein connectivity maps from molecular interaction networks and PubMed abstracts. PLoS Comput Biol. 2009b;5(7):e1000450.

    Article  PubMed  Google Scholar 

  • Li X, et al. A mouse protein interactome through combined literature mining with multiple sources of interaction evidence. Amino Acids. 2010;38(4):1237–52.

    Article  PubMed  CAS  Google Scholar 

  • Liekens AM, et al. BioGraph: unsupervised biomedical knowledge discovery via automated hypothesis generation. Genome Biol. 2011;12(6):R57.

    Article  PubMed  Google Scholar 

  • Lin YF. BIOKDD04: 4th workshop on data mining in bioinformatics (with SIGKDD conference). In: A maximum entropy approach to biomedical named entity recognition; 2004.

    Google Scholar 

  • Liu KQ, et al. Identifying dysregulated pathways in cancers from pathway interaction networks. BMC Bioinform. 2012;13(1):126.

    Article  Google Scholar 

  • Logue JS, Morrison DK. Complexity in the signaling network: insights from the use of targeted inhibitors in cancer therapy. Genes Dev. 2012;26(7):641–50.

    Article  PubMed  CAS  Google Scholar 

  • Macilwain C. Systems biology: evolving into the mainstream. Cell. 2011;144(6):839–41.

    Article  PubMed  CAS  Google Scholar 

  • Mack R, Hehenberger M. Text-based knowledge discovery: search and mining of life-sciences documents. Drug Discov Today. 2012;7:89–98.

    Article  Google Scholar 

  • Matos S, et al. Concept-based query expansion for retrieving gene related publications from MEDLINE. BMC Bioinform. 2010;11:212.

    Article  Google Scholar 

  • Mattila J, et al. Design and application of a generic clinical decision support system for multiscale data. IEEE Trans Biomed Eng. 2012;59(1):234–40.

    Article  PubMed  Google Scholar 

  • McEntyre J, Lipman D. PubMed: bridging the information gap. CMAJ. 2001;164(9):1317–9.

    PubMed  CAS  Google Scholar 

  • Nam S, Park T. Pathway-based evaluation in early onset colorectal cancer suggests focal adhesion and immunosuppression along with epithelial-mesenchymal transition. PLoS One. 2012;7.

    Google Scholar 

  • Novichkova S, Egorov S, Daraselia N. MedScan, a natural language processing engine for MEDLINE abstracts. Bioinformatics. 2003;19(13):1699–706.

    Article  PubMed  CAS  Google Scholar 

  • Okazaki N, Ananiadou S. Building an abbreviation dictionary using a term recognition approach. Bioinformatics. 2006;22(24):3089–95.

    Article  PubMed  CAS  Google Scholar 

  • Papp B, Notebaart RA, Pal C. Systems-biology approaches for predicting genomic evolution. Nat Rev Genet. 2011;12(9):591–602.

    Article  PubMed  CAS  Google Scholar 

  • Peri S, et al. Development of human protein reference database as an initial platform for approaching systems biology in humans. Genome Res. 2003;13(10):2363–71.

    Article  PubMed  CAS  Google Scholar 

  • Pinney JW, et al. HIV-host interactions: a map of viral perturbation of the host system. AIDS. 2009;23(5):549–54.

    PubMed  Google Scholar 

  • Prasad TSK, et al. Human protein reference database—2009 update. Nucleic Acids Res. 2009;37(Database issue):D767–72.

    Google Scholar 

  • Ptak RG, et al. Cataloguing the HIV type 1 human protein interaction network. AIDS Res Hum Retroviruses. 2008;24(12):1497–502.

    Article  PubMed  CAS  Google Scholar 

  • Pyysalo S, et al. BioInfer: a corpus for information extraction in the biomedical domain. BMC Bioinform. 2007;8:50.

    Article  Google Scholar 

  • Qabaja A, Alshalalfa M, Bismar TA, Alhajj R. Protein network-based Lasso regression model for the construction of disease-miRNA functional interactions. EURASIP J Bioinform Syst Biol. 2013;1:3.

    Article  Google Scholar 

  • Ramasubbu R, et al. The Canadian network for mood and anxiety treatments (CANMAT) task force recommendations for the management of patients with mood disorders and select comorbid medical conditions. Ann Clin Psychiatry. 2012;24(1):91–109.

    PubMed  Google Scholar 

  • Raychaudhuri S, Altman RB. A literature-based method for assessing the functional coherence of a gene group. Bioinformatics. 2003;19(3):396–401.

    Article  PubMed  CAS  Google Scholar 

  • Raychaudhuri S, Schutze H, Altman RB. Using text analysis to identify functionally coherent gene groups. Genome Res. 2002;12(10):1582–90.

    Article  PubMed  CAS  Google Scholar 

  • Rebholz-Schuhmann D, et al. Assessment of NER solutions against the first and second CALBC silver standard corpus. J Biomed Seman. 2011;2(Suppl 5):S11.

    Article  Google Scholar 

  • Rosario B, Hearst MA. Multi-way relation classification: application to protein–protein interactions. 2005.

    Google Scholar 

  • Rosario B, Hearst MA. Classifying semantic relations in bioscience text. In: Proceedings of the 42nd annual meeting of the association for computational linguistics (ACL 2004). Barcelona; 2004.

    Google Scholar 

  • Rosario B, Hearst MA. Classifying semantic relations in bioscience texts. 2004.

    Google Scholar 

  • Rosario B, Hearst MA. Multi-way relation classification: application to protein–protein interaction. In: HLT-NAACL’05. Vancouver; 2005.

    Google Scholar 

  • Sasaki Y, et al. How to make the most of NE dictionaries in statistical NER. BMC Bioinform. 2008;9(Suppl 11):S5.

    Article  Google Scholar 

  • Schwartz AS, Hearst MA. A simple algorithm for identifying abbreviation definitions in biomedical text. Pac Symp Biocomput. 2003;451–62.

    Google Scholar 

  • Settles B. ABNER: an open source tool for automatically tagging genes, proteins and other entity names in text. Bioinformatics. 2005;21(14):3191–2.

    Article  PubMed  CAS  Google Scholar 

  • Sharma P, et al. Mining literature for a comprehensive pathway analysis: a case study for retrieval of homocysteine related genes for genetic and epigenetic studies. Lipids Health Dis. 2006;5:1.

    Article  PubMed  Google Scholar 

  • Staiger C, et al. A critical evaluation of network and pathway-based classifiers for outcome prediction in breast cancer. PLoS ONE. 2012;7(4):e34796.

    Article  PubMed  CAS  Google Scholar 

  • Swanson DR. Fish oil, Raynaud’s syndrome, and undiscovered public knowledge. Perspect Biol Med. 1986;30:7–18.

    PubMed  CAS  Google Scholar 

  • Tanabe L, et al. GENETAG: a tagged corpus for gene/protein named entity recognition. BMC Bioinform. 2005;6(Suppl 1):S3.

    Article  Google Scholar 

  • Thompson P, et al. The BioLexicon: a large-scale terminological resource for biomedical text mining. BMC Bioinform. 2011;12:397.

    Article  Google Scholar 

  • Topinka CM, Shyu C. Predicting cancer interaction networks using text-mining and structure understanding. In: AMIA annual symposium proceeding. 2006.

    Google Scholar 

  • Trugenberger CA, et al. Discovery of novel biomarkers and phenotypes by semantic technologies. BMC Bioinform. 2013;14(51):51.

    Article  Google Scholar 

  • Tsai FS. Text mining and visualisation of protein–protein interactions. Int J Comput Biol Drug Des. 2011;4(3):239–44.

    Article  PubMed  CAS  Google Scholar 

  • Tsai T, et al. Integrating linguistic knowledge into a conditional random fieldframework to identify biomedical named entities. Expert Syst Appl. 2006;30(1):117–28.

    Article  Google Scholar 

  • Tsuruoka Y, Tsujii J. Bidirectional inference with the easiest-first strategy for tagging sequence data. In: Association for computational linguistics Morristown, NJ, USA; 2005.

    Google Scholar 

  • Turenne N, Tiys E, Ivanisenko V, Yudin N, Ignatieva E, Valour D, Degrelle SA, Hue I. Finding biomarkers in non-model species: literature mining of transcription factors involved in bovine embryo development. BioData Min. 2012;5(12):1–12.

    Google Scholar 

  • Urzua U, Owens G, Zhang GM, Cherry JM, Sharp JJ. Tumor and reproductive traits are linked by RNA metabolism genes in the mouse ovary: a transcriptome-phenotype association analysis. BMC Genomics. 2010;11.

    Google Scholar 

  • Vastrik I, et al. Reactome: a knowledge base of biologic pathways and processes. Genome Biol. 2007;8(3):R39.

    Article  PubMed  Google Scholar 

  • Vastrik I, et al. Correction: Reactome: a knowledge base of biologic pathways and processes. Genome Biol. 2009;10(2):402.

    Article  Google Scholar 

  • Wang B. BRCA1 tumor suppressor network: focusing on its tail. Cell Biosci. 2012;2(1):6.

    Article  PubMed  Google Scholar 

  • Wei MY, Giovannucci EL. Lycopene, tomato products, and prostate cancer incidence: a review and reassessment in the PSA screening era. J Oncol. 2012;2012:271063.

    Article  PubMed  Google Scholar 

  • Wren JD, Garner HR. Shared relationship analysis: ranking set cohesion and commonalities within a literature-derived relationship network. Bioinformatics. 2004;20(2):191–8.

    Article  PubMed  CAS  Google Scholar 

  • Yang Y, Adelstein S, Kassis AI. Target discovery from data mining approaches. Drug Discov Today. 2012;17.

    Google Scholar 

  • Zhou GD, Su J. Exploring deep knowledge resources in biomedical name recognition. In: JNLPBA; 2004.

    Google Scholar 

  • Zhu F, Shen B. Combined SVM-CRFs for biological named entity recognition with maximal bidirectional squeezing. PLoS ONE. 2012;7(8):1–8.

    Google Scholar 

  • Zhu F, Patumcharoenpol P, Zhang C, Yang Y, Chan J, Meechai A, Vongsangnak W, Shen B. Biomedical text mining and its applications in cancer research. J Biomed Inform. 2013;46(2):200–11.

    Article  PubMed  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Bairong Shen .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2013 Springer Science+Business Media Dordrecht

About this chapter

Cite this chapter

Zhu, F., Shen, B. (2013). Biomarker Discovery with Text Mining and Literature Based Discovery. In: Shen, B. (eds) Bioinformatics for Diagnosis, Prognosis and Treatment of Complex Diseases. Translational Bioinformatics, vol 4. Springer, Dordrecht. https://doi.org/10.1007/978-94-007-7975-4_4

Download citation

Publish with us

Policies and ethics