A Supervised Machine Learning Approach of Extracting and Ranking Published Papers Describing Coexpression Relationships among Genes

  • Richa TiwariEmail author
  • Chengcui Zhang
  • Thamar Solorio
  • Wei-Bang Chen


In this chapter, we describe a framework to extract information about coexpression relationships among genes from published literature using a supervised machine learning approach, and later rank those papers to provide users with a complete specialized information retrieval system. We use Dynamic Conditional Random Fields (DCRFs), for training our classification model. Our approach is based on semantic analysis of text to classify the predicates describing coexpression rather than detecting the presence of keywords. Our framework outperformed the baseline by almost 52%, with DCRFs showing superior performance to Bayes Net, SVM, and Naïve Bayes classification algorithm. In our second experiment, the comparison of our ranked results to that of PubMed and Google demonstrates that our proposed model performs better than both in distinguishing a positive paper from a negative paper. In conclusion, this chapter describes a specialized classification and ranking framework that can retrieve articles that discuss coexpression among genes.


Co-expression Relationships Dynamic Conditional Random Fields (DCRFs) DCRFs Model Chunk Tags Biomedical Entities (BME) 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  1. 1.
    PDFBox - Java PDF library, The Apache Software Foundation,
  2. 2.
    Blaschke, C., Valencia, A.: The Frame-Based module of the SUISEKI information extraction system. J. Intell. Syst. 17(2), 14–20 (2002)Google Scholar
  3. 3.
    Bunescu, R., Mooney, R., Ramani, A., Marcotte, E.: Integrating co-occurrence statistics with information extraction for robust retrieval of protein interactions from Medline. In: Workshop on Linking Natural Language Processing and Biology: Towards Deeper Biological Literature Analysis (BioNLP ’06), pp. 49–56 (2006)Google Scholar
  4. 4.
    Bundschus, M., Dejori, M., Stetter, M., Tresp, V., Kriegel, H.P.: Extraction of semantic biomedical relations from text using conditional random fields. J. BMC Bioinformatics 9(207), (2008)Google Scholar
  5. 5.
    Clark, J., Koprinska I., Poon J.: A neural network based approach to automated e-mail classification. In: IEEE/WIC International Conference on Web Intelligence, pp. 702–705 (2003)Google Scholar
  6. 6.
    Cohen, A., Hersch, W.: A survey of current work in biomedical text mining. Briefings in Bioinformatics. 6, 57–71 (2005)CrossRefGoogle Scholar
  7. 7.
    Coulibaly, I., Page, G.P.: Bioinformatic tools for inferring functional information from plant microarray data II: analysis beyond single gene. Int. J. Plant Genomics. (2008)Google Scholar
  8. 8.
    Craven, M.: Learning to extract relations from medline. In: AAAI-99 Workshop on Machine Learning for Information Extraction (1999)Google Scholar
  9. 9.
    Tsuruoka, Y., Jun’ichi, T.: Bidirectional Inference with the Easiest-First Strategy for Tagging Sequence Data, Proceedings of HLT/EMNLP, pp. 467–474 (2005).Google Scholar
  10. 10.
    Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P.W.I.H.: The WEKA data mining software: An update. SIGKDD Explorations 11(1), (2009)Google Scholar
  11. 11.
    KDD: The Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 23–26 Edmonton, Alberta, CA (2002)
  12. 12.
    McCallum, A., Li, W.: Early results for named entity recognition with conditional random fields, feature induction and web-enhanced lexicons. In: 7th Conference on Natural Language Learning (CoNLL), pp: 188-191 (2003)Google Scholar
  13. 13.
    MEDLINE, National Center for Biotechnology Information,
  14. 14.
    Miyao, Y., Sagae, K., Saetre, R., Matsuzaki, T., Tsujii, J.: Evaluating contributions of natural language parsers to protein-protein interaction extraction. J. Bioinformatics. 25(3), 394–400 (2009)CrossRefGoogle Scholar
  15. 15.
    Miwa, M., Saetre, R., Miyao, Y., Tsujii, J.: A rich feature vector for protein-protein interaction extraction from multiple corpora. In: Conference on Empirical Methods in Natural Language Processing (EMNLP ’09), pp. 121–130 (2009)Google Scholar
  16. 16.
    PubMed, National Center for Biotechnology Information,
  17. 17.
    Peri, S., Navaroo, J.D., Kristiansen, T.Z., Amanchy, R., Surendranath, V., Muthusamy, B., Gandhi, T.K., Chandrika, K.N., Deshpande, N., Suresh, S.: Human protein referncee database as a discovery resource for proteomics: J. Nuclein Acids Res, (2004)Google Scholar
  18. 18.
    Seymore, K., McCallum, A., Rosenfeld, R.: Learning hidden markov model structure for information extraction. In: AAAI Workshop on Machine Learning for Information Extraction, pp. 37–42 (1999)Google Scholar
  19. 19.
    Sutton, C., McCallum, A.: An introduction to conditional random fields for relational learning. Introduction to Statistical Relational Learning, MIT Press (2006)Google Scholar
  20. 20.
    Sutton, C.: GRMM: GRaphical Models in Mallet, (2006)
  21. 21.
    Sutton, C., McCallum, A., Rohanimanesh, K.: Dynamic conditional random fields: Factorized probabilistic models for labeling and segmenting sequence data. J. Mach. Learn. Res. 8, 693–723 (2004)Google Scholar
  22. 22.
    Rau, L.F., Jacobs, P.S., Zernik, U.: Information extraction and text summarization using linguistic knowledge acquisition. J. Inform. Process. Manag. 25(4), 419–428 (1989)CrossRefGoogle Scholar
  23. 23.
    Tiwari, R., Zhang, C., Solorio, T.: A Supervised Machine Learning Approach of Extracting Coexpression Relationship among Genes from Literature. In: 11th IEEE International Conference on Information Reuse and Integration, pp. 98–103 (2010)Google Scholar
  24. 24.
    Voorhees, E.M.: Variations in relevance judgments and the measurement of retrieval effectiveness. In: 21st Annual Int. ACM SIGIR, New York, NY, USA, pp. 315–323 (1998)Google Scholar
  25. 25.
    Wei, X., Croft, B., McCallum, A.: Table extraction for answer retrieval. J. Information Retrieval. 9(5), 589–611 (2006)CrossRefGoogle Scholar

Copyright information

© Springer Vienna 2012

Authors and Affiliations

  • Richa Tiwari
    • 1
    Email author
  • Chengcui Zhang
    • 1
  • Thamar Solorio
    • 1
  • Wei-Bang Chen
    • 1
  1. 1.Department of Computer and Information SciencesThe University of Alabama at BirminghamBirminghamUSA

Personalised recommendations