Information Retrieval

, Volume 12, Issue 1, pp 81–97 | Cite as

Tasks, topics and relevance judging for the TREC Genomics Track: five years of experience evaluating biomedical text information retrieval systems

  • Phoebe M. Roberts
  • Aaron M. Cohen
  • William R. Hersh


With the help of a team of expert biologist judges, the TREC Genomics track has generated four large sets of “gold standard” test collections, comprised of over a hundred unique topics, two kinds of ad hoc retrieval tasks, and their corresponding relevance judgments. Over the years of the track, increasingly complex tasks necessitated the creation of judging tools and training guidelines to accommodate teams of part-time short-term workers from a variety of specialized biological scientific backgrounds, and to address consistency and reproducibility of the assessment process. Important lessons were learned about factors that influenced the utility of the test collections including topic design, annotations provided by judges, methods used for identifying and training judges, and providing a central moderator “meta-judge”.


Reference standards Evaluation Inter-annotator agreement Text mining Information retrieval 



The TREC Genomics Track was funded by grant ITR-0325160 to W.R.H. from the U.S. National Science Foundation. The authors would like to thank the Genomics track steering committee, especially Kevin Bretonnel Cohen and Anna Divoli, for helpful discussions about relevance judgments and guidelines.


  1. Borlund, P. (2003). The concept of relevance in IR. Journal of the American Society for Information Science and Technology, 54(10), 913–925.CrossRefGoogle Scholar
  2. Burkhardt, K., Schneider, B., et al. (2006). A biocurator perspective: Annotation at the Research Collaboratory for Structural Bioinformatics Protein Data Bank. PLoS Computational Biology, 2(10), e99.CrossRefGoogle Scholar
  3. Cohen, A. M., & Hersh, W. R. (2006). The TREC 2004 genomics track categorization task: Classifying full text biomedical documents. Journal of Biomedical Discovery and Collaboration, 1, 4.CrossRefGoogle Scholar
  4. Cohen, K. B., Fox, L., et al. (2005). Empirical data on corpus design and usage in biomedical natural language processing. AMIA Annual Symposium Proceedings, 156–160.Google Scholar
  5. Colosimo, M. E., Morgan, A. A., et al. (2005). Data preparation and interannotator agreement: BioCreAtIvE task 1B. BMC Bioinformatics, 6(Suppl 1), S12. Epub 2005 May 24.CrossRefGoogle Scholar
  6. Dingare, S., Finkel, J., et al. (2005). A system for identifying named entities in biomedical text: How results from two evaluations reflect on both the system and the evaluations. Comparative and Functional Genomics, 6, 77–85.CrossRefGoogle Scholar
  7. Gerstein, M., Seringhaus, M., et al. (2007). Structured digital abstract makes text mining easy. Nature, 447(7141), 142.CrossRefGoogle Scholar
  8. Hahn, U., Wermter, J., et al. (2007). Text mining: Powering the database revolution. Nature, 448(7150), 130.CrossRefGoogle Scholar
  9. Hersh, W., & Bhupatiraju, R. T. (2003). TREC genomics track overview. The Twelfth Text Retrieval Conference (TREC 2004). Gaithersburg, MD: National Institute for Standards & Technology.Google Scholar
  10. Hersh, W. R., Cohen, A., et al. (2006). TREC 2006 genomics track overview. The Fifteenth Text Retrieval Conference (TREC 2006). Gaithersburg, MD: National Institute for Standards & Technology.Google Scholar
  11. Hersh, W. R., Cohen, A., et al. (2007). TREC 2007 genomics track overview. The Sixteenth Text Retrieval Conference (TREC 2007). Gaithersburg, MD: National Institute for Standards & Technology.Google Scholar
  12. Hirschman, L., Morgan, A. A., et al. (2002). Rutabaga by any other name: Extracting biological names. Journal of Biomedical Informatics, 35(4), 247–259.CrossRefGoogle Scholar
  13. Hirschman, L., Yeh, A., et al. (2005). Overview of BioCreAtIvE: Critical assessment of information extraction for biology. BMC Bioinformatics, 6(Suppl 1), S1.CrossRefGoogle Scholar
  14. Hripcsak, G., & Wilcox, A. (2002). Reference standards, judges, and comparison subjects: Roles for experts in evaluating system performance. Journal of the American Medical Informatics Association, 9(1), 1–15.CrossRefGoogle Scholar
  15. Ide, N. C., Loane, R. F., et al. (2007). Essie: A concept-based search engine for structured biomedical text. Journal of the American Medical Informatics Association, 14(3), 253–263.CrossRefGoogle Scholar
  16. Lipscomb, C. E. (2000). Medical subject headings (MeSH). Bulletin of the Medical Library Association, 88(3), 265–266.Google Scholar
  17. Medlock, B. (2008). Exploring hedge identification in biomedical literature. Journal of Biomedical Informatics, 41(4), 636–654.CrossRefGoogle Scholar
  18. Pyysalo, S., Airola, A., et al. (2008). Comparative analysis of five protein-protein interaction corpora. BMC Bioinformatics, 9(Suppl 3), S6.CrossRefGoogle Scholar
  19. Salimi, N., & Vita, R. (2006). The biocurator: Connecting and enhancing scientific data. PLoS Computational Biology, 2(10), e125.CrossRefGoogle Scholar
  20. Voorhees, E. M. (2000). Variations in relevance judgments and the measurement of retrieval effectiveness. Information Processing and Management, 36, 697–716.CrossRefGoogle Scholar
  21. Xu, Y. C., & Chen, Z. (2006). Relevance judgment: What do information users consider beyond topicality? Journal of the American Society for Information Science and Technology, 57(7), 961–973.CrossRefGoogle Scholar
  22. Yeh, A. S., Hirschman, L., et al. (2003). Evaluation of text data mining for database curation: Lessons learned from the KDD Challenge Cup. Bioinformatics, 19(Suppl 1), i331–i339.CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC 2008

Authors and Affiliations

  • Phoebe M. Roberts
    • 1
  • Aaron M. Cohen
    • 2
  • William R. Hersh
    • 2
  1. 1.Pfizer Research Technology CenterCambridgeUSA
  2. 2.Department of Medical Informatics and Clinical EpidemiologySchool of Medicine, Oregon Health & Science UniversityPortlandUSA

Personalised recommendations