Skip to main content

A Supervised Machine Learning Approach of Extracting and Ranking Published Papers Describing Coexpression Relationships among Genes

  • Chapter
  • First Online:
Recent Trends in Information Reuse and Integration
  • 457 Accesses

Abstract

In this chapter, we describe a framework to extract information about coexpression relationships among genes from published literature using a supervised machine learning approach, and later rank those papers to provide users with a complete specialized information retrieval system. We use Dynamic Conditional Random Fields (DCRFs), for training our classification model. Our approach is based on semantic analysis of text to classify the predicates describing coexpression rather than detecting the presence of keywords. Our framework outperformed the baseline by almost 52%, with DCRFs showing superior performance to Bayes Net, SVM, and Naïve Bayes classification algorithm. In our second experiment, the comparison of our ranked results to that of PubMed and Google demonstrates that our proposed model performs better than both in distinguishing a positive paper from a negative paper. In conclusion, this chapter describes a specialized classification and ranking framework that can retrieve articles that discuss coexpression among genes.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 109.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. PDFBox - Java PDF library, The Apache Software Foundation, http://incubator.apache.org/pdfbox/index.html

  2. Blaschke, C., Valencia, A.: The Frame-Based module of the SUISEKI information extraction system. J. Intell. Syst. 17(2), 14–20 (2002)

    Google Scholar 

  3. Bunescu, R., Mooney, R., Ramani, A., Marcotte, E.: Integrating co-occurrence statistics with information extraction for robust retrieval of protein interactions from Medline. In: Workshop on Linking Natural Language Processing and Biology: Towards Deeper Biological Literature Analysis (BioNLP ’06), pp. 49–56 (2006)

    Google Scholar 

  4. Bundschus, M., Dejori, M., Stetter, M., Tresp, V., Kriegel, H.P.: Extraction of semantic biomedical relations from text using conditional random fields. J. BMC Bioinformatics 9(207), (2008)

    Google Scholar 

  5. Clark, J., Koprinska I., Poon J.: A neural network based approach to automated e-mail classification. In: IEEE/WIC International Conference on Web Intelligence, pp. 702–705 (2003)

    Google Scholar 

  6. Cohen, A., Hersch, W.: A survey of current work in biomedical text mining. Briefings in Bioinformatics. 6, 57–71 (2005)

    Article  Google Scholar 

  7. Coulibaly, I., Page, G.P.: Bioinformatic tools for inferring functional information from plant microarray data II: analysis beyond single gene. Int. J. Plant Genomics. (2008)

    Google Scholar 

  8. Craven, M.: Learning to extract relations from medline. In: AAAI-99 Workshop on Machine Learning for Information Extraction (1999)

    Google Scholar 

  9. Tsuruoka, Y., Jun’ichi, T.: Bidirectional Inference with the Easiest-First Strategy for Tagging Sequence Data, Proceedings of HLT/EMNLP, pp. 467–474 (2005).

    Google Scholar 

  10. Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P.W.I.H.: The WEKA data mining software: An update. SIGKDD Explorations 11(1), (2009)

    Google Scholar 

  11. KDD: The Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 23–26 Edmonton, Alberta, CA http://www.sigkdd.org/kdd2002 (2002)

  12. McCallum, A., Li, W.: Early results for named entity recognition with conditional random fields, feature induction and web-enhanced lexicons. In: 7th Conference on Natural Language Learning (CoNLL), pp: 188-191 (2003)

    Google Scholar 

  13. MEDLINE, National Center for Biotechnology Information, http://www.ncbi.nlm.nih.gov/pubs/facsheets/medline.html

  14. Miyao, Y., Sagae, K., Saetre, R., Matsuzaki, T., Tsujii, J.: Evaluating contributions of natural language parsers to protein-protein interaction extraction. J. Bioinformatics. 25(3), 394–400 (2009)

    Article  Google Scholar 

  15. Miwa, M., Saetre, R., Miyao, Y., Tsujii, J.: A rich feature vector for protein-protein interaction extraction from multiple corpora. In: Conference on Empirical Methods in Natural Language Processing (EMNLP ’09), pp. 121–130 (2009)

    Google Scholar 

  16. PubMed, National Center for Biotechnology Information, http://www.ncbi.nlm.nih.gov/pubmed

  17. Peri, S., Navaroo, J.D., Kristiansen, T.Z., Amanchy, R., Surendranath, V., Muthusamy, B., Gandhi, T.K., Chandrika, K.N., Deshpande, N., Suresh, S.: Human protein referncee database as a discovery resource for proteomics: J. Nuclein Acids Res, (2004)

    Google Scholar 

  18. Seymore, K., McCallum, A., Rosenfeld, R.: Learning hidden markov model structure for information extraction. In: AAAI Workshop on Machine Learning for Information Extraction, pp. 37–42 (1999)

    Google Scholar 

  19. Sutton, C., McCallum, A.: An introduction to conditional random fields for relational learning. Introduction to Statistical Relational Learning, MIT Press (2006)

    Google Scholar 

  20. Sutton, C.: GRMM: GRaphical Models in Mallet, http://mallet.cs.umass.edu/grmm (2006)

  21. Sutton, C., McCallum, A., Rohanimanesh, K.: Dynamic conditional random fields: Factorized probabilistic models for labeling and segmenting sequence data. J. Mach. Learn. Res. 8, 693–723 (2004)

    Google Scholar 

  22. Rau, L.F., Jacobs, P.S., Zernik, U.: Information extraction and text summarization using linguistic knowledge acquisition. J. Inform. Process. Manag. 25(4), 419–428 (1989)

    Article  Google Scholar 

  23. Tiwari, R., Zhang, C., Solorio, T.: A Supervised Machine Learning Approach of Extracting Coexpression Relationship among Genes from Literature. In: 11th IEEE International Conference on Information Reuse and Integration, pp. 98–103 (2010)

    Google Scholar 

  24. Voorhees, E.M.: Variations in relevance judgments and the measurement of retrieval effectiveness. In: 21st Annual Int. ACM SIGIR, New York, NY, USA, pp. 315–323 (1998)

    Google Scholar 

  25. Wei, X., Croft, B., McCallum, A.: Table extraction for answer retrieval. J. Information Retrieval. 9(5), 589–611 (2006)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Richa Tiwari .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2012 Springer Vienna

About this chapter

Cite this chapter

Tiwari, R., Zhang, C., Solorio, T., Chen, WB. (2012). A Supervised Machine Learning Approach of Extracting and Ranking Published Papers Describing Coexpression Relationships among Genes. In: Özyer, T., Kianmehr, K., Tan, M. (eds) Recent Trends in Information Reuse and Integration. Springer, Vienna. https://doi.org/10.1007/978-3-7091-0738-6_14

Download citation

  • DOI: https://doi.org/10.1007/978-3-7091-0738-6_14

  • Published:

  • Publisher Name: Springer, Vienna

  • Print ISBN: 978-3-7091-0737-9

  • Online ISBN: 978-3-7091-0738-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics