Skip to main content

Part of the book series: Health Informatics ((HI))

Abstract

This chapter explores the indexing process of information retrieval. After some introductory discussion, the two broad approaches to indexing, manual and automated, are described. For manual indexing, approaches applied to bibliographic, full-text, and Web-based content are presented. This is followed by a description of automated approaches to indexing, with discussion limited to those used in operational retrieval systems. The problems associated with each type of indexing are explored. The final section describes computer data structures used to maintain indexing information for efficient retrieval.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 54.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 69.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 109.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://www.bioontology.org/

  2. 2.

    https://bioportal.bioontology.org/

  3. 3.

    https://schema.org/

  4. 4.

    https://www.w3.org/community/schemaorg/

  5. 5.

    http://schema.org/MedicalEntity

  6. 6.

    https://bioschemas.org/

  7. 7.

    https://www.nlm.nih.gov/mesh/meshhome.html

  8. 8.

    https://www.nlm.nih.gov/mesh/intro_record_types.html

  9. 9.

    https://meshb.nlm.nih.gov/search

  10. 10.

    https://meshb.nlm.nih.gov/MeSHonDemand

  11. 11.

    https://www.nlm.nih.gov/bsd/indexing/training/CHK_010.html

  12. 12.

    https://www.nlm.nih.gov/mesh/pubtypes.html

  13. 13.

    https://connect.ebsco.com/s/article/CINAHL-Subject-Headings-Frequently-Asked-Questions

  14. 14.

    https://www.apa.org/pubs/databases/psycinfo

  15. 15.

    https://www.apa.org/pubs/databases/training/class-codes

  16. 16.

    http://geneontology.org/

  17. 17.

    http://geneontology.org/docs/guide-go-evidence-codes/

  18. 18.

    https://ncit.nci.nih.gov/

  19. 19.

    https://www.nlm.nih.gov/research/umls/

  20. 20.

    https://www.nlm.nih.gov/research/umls/knowledge_sources/metathesaurus/release/statistics.html

  21. 21.

    https://ncim.nci.nih.gov/ncimbrowser/

  22. 22.

    https://ncim.nci.nih.gov/ncimbrowser/pages/source_help_info.jsf

  23. 23.

    https://ncit.nci.nih.gov/ncitbrowser/start.jsf

  24. 24.

    https://www.nlm.nih.gov/bsd/indexing/training/USE_010.html

  25. 25.

    https://ii.nlm.nih.gov/Interactive/MTI/mti.shtml

  26. 26.

    https://www.ncbi.nlm.nih.gov/gene/about-generif

  27. 27.

    https://www.merckmanuals.com/professional

  28. 28.

    https://www.uptodate.com/

  29. 29.

    https://www.dublincore.org/

  30. 30.

    https://www.dublincore.org/specifications/dublin-core/dcmi-terms/

  31. 31.

    https://www.dublincore.org/resources/userguide/creating_metadata/

  32. 32.

    http://www.chu-rouen.fr/cismef/

  33. 33.

    https://www.inserm.fr/en/professional-area/scientific-and-technical-information/bilingual-mesh

  34. 34.

    https://www.yahoo.com/

  35. 35.

    https://dmoz-odp.org/

  36. 36.

    https://www.flickr.com/

  37. 37.

    http://hiru.mcmaster.ca/more/

  38. 38.

    https://www.amazon.com/

  39. 39.

    https://www.netflix.com/

  40. 40.

    https://images.google.com/

  41. 41.

    http://www.radlex.org/

  42. 42.

    https://www.dublincore.org/specifications/lrmi/

  43. 43.

    http://schema.org/Course

  44. 44.

    http://oerschema.org/

  45. 45.

    https://bioschemas.org/specifications/drafts/Course/

  46. 46.

    https://bioschemas.org/specifications/drafts/CourseInstance/

  47. 47.

    https://bioschemas.org/specifications/TrainingMaterial/

  48. 48.

    https://standards.ieee.org/project/1484_12_1.html

  49. 49.

    https://www.medbiq.org/

  50. 50.

    https://library.med.utah.edu/heal/

  51. 51.

    https://github.com/gvwilson/harper

  52. 52.

    https://bigdatau.ini.usc.edu/about_erudite

  53. 53.

    https://github.com/NLightenGroup/nlighten-ontology

  54. 54.

    https://clic-ctsa.org/index.php/diamond

  55. 55.

    https://www.eagle-i.net/

  56. 56.

    http://orcid.org

  57. 57.

    http://orcid.org/0000-0002-4114-5148

References

  1. Miles W. A history of the National Library of Medicine: the nation’s treasury of medical knowledge. Bethesda, MD: U.S. Department of Health and Human Services; 1982.

    Google Scholar 

  2. Anonymous. Index Medicus to cease as print publication. NLM Tech Bull. 2004;2004:e2.

    Google Scholar 

  3. Leonard L. Inter-indexer consistency and retrieval effectiveness: measurement of relationships. Champaign, IL: University of Illinois; 1975.

    Google Scholar 

  4. Funk M, Reid C. Indexing consistency in MEDLINE. Bull Med Libr Assoc. 1983;71:176–83.

    CAS  PubMed Central  PubMed  Google Scholar 

  5. Arp R, Smith B, Spear A. Building ontologies with basic formal ontology. Cambridge, MA: MIT Press; 2015.

    Book  Google Scholar 

  6. Cimino J, Zhu X. The practical impact of ontologies on biomedical informatics. Methods Inf Med. 2006;45(Supp 1):124–35.

    Google Scholar 

  7. Harrow I, Balakrishnan R, Jimenez-Ruiz E, Jupp S, Lomax J, Reed J, et al. Ontology mapping for semantically enabled applications. Drug Discov Today. 2019;24:2068–75.

    Article  PubMed  Google Scholar 

  8. Köhler S, Vasilevsky N, Engelstad M, Foster E, McMurry J, Groza T, et al. The human phenotype ontology in 2017. Nucleic Acids Res. 2017;45:D865–76.

    Article  CAS  PubMed  Google Scholar 

  9. Coletti M, Bleich H. Medical subject headings used to search the biomedical literature. J Am Med Inform Assoc. 2001;8:317–23.

    Article  CAS  PubMed Central  PubMed  Google Scholar 

  10. Anonymous. A comparison of Emtree® and MeSH®. Amsterdam: Elsevier R&D Solutions; 2015.

    Google Scholar 

  11. Blake J. Ten quick tips for using the Gene Ontology. PLoS Comput Biol. 2013;9(11):e1003343.

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  12. Dessimoz C, Škunca N, editors. The Gene Ontology handbook, Methods in molecular biology. New York: Springer Nature; 2017.

    Google Scholar 

  13. Cimino J. Desiderata for controlled medical vocabularies in the twenty-first century. Methods Inf Med. 1998;37:394–403.

    Article  CAS  PubMed Central  PubMed  Google Scholar 

  14. Lindberg D, Humphreys B, McCray A. The Unified Medical Language System project. Methods Inf Med. 1993;32:281–91.

    Article  CAS  PubMed Central  PubMed  Google Scholar 

  15. Evans D, editor. Pragmatically-structured, lexical-semantic knowledge bases for unified medical language systems. Proceedings of the 12th Annual Symposium on Computer Applications in Medical Care; 1988; Washington, DC: IEEE.

    Google Scholar 

  16. Masarie F, Miller R, Bouhaddou O, Giuse N, Warner H. An interlingua for electronic exchange of medical information: using frames to map between clinical vocabularies. Comput Biomed Res. 1991;24:379–400.

    Article  PubMed  Google Scholar 

  17. Barr C, Komorowski H, Pattison-Gordon E, Greenes R, editors. Conceptual modeling for the Unified Medical Language System. Proceedings of the 12th Annual Symposium on Computer Applications in Medical Care; 1988; Washington, DC: IEEE.

    Google Scholar 

  18. Humphreys B, Lindberg D, Schoolman H, Barnett G. The Unified Medical Language System: an informatics research collaboration. J Am Med Inform Assoc. 1998;5:1–11.

    Article  CAS  PubMed Central  PubMed  Google Scholar 

  19. Bodenreider O. The Unified Medical Language System (UMLS): integrating biomedical terminology. Nucleic Acids Res. 2004;32:D267–70.

    Article  CAS  PubMed Central  PubMed  Google Scholar 

  20. Chen Y, Perl Y, Geller J, Cimino J. Analysis of a study of the users, uses, and future agenda of the UMLS. J Am Med Inform Assoc. 2007;14:221–31.

    Article  PubMed Central  PubMed  Google Scholar 

  21. Charen T. MEDLARS indexing manual, part I: bibliographic principles and descriptive indexing, 1977. Springfield, VA: National Technical Information Service; 1976.

    Google Scholar 

  22. Charen T. MEDLARS indexing manual, part II. Springfield, VA: National Technical Information Service; 1983.

    Google Scholar 

  23. Bachrach C, Charen T. Selection of MEDLINE contents, the development of its thesaurus, and the indexing process. Med Inform. 1978;3:237–54.

    Article  CAS  Google Scholar 

  24. Mork J, Jimeno-Yepes A, Aronson A, editors. The NLM medical text indexer system for indexing biomedical literature. BioASQ Workshop; 2013, Valencia.

    Google Scholar 

  25. Mork J, Aronson A, Demner-Fushman D. 12 years on—is the NLM medical text indexer still useful and relevant? J Biomed Semant. 2017;2017(8):8.

    Article  Google Scholar 

  26. Nahin A. Full author searching comes to PubMed. NLM Tech Bull. 2003;2003:e4.

    Google Scholar 

  27. Mitchell J, Aronson A, Mork J, Folk L, Humphrey S, Ward J, editors. Gene indexing: characterization and analysis of NLM’s GeneRIFs. Proceedings of the AMIA 2003 Annual Symposium; 2003; Washington, DC: Hanley & Belfus.

    Google Scholar 

  28. Maglott D, Ostell J, Pruitt K, Tatusova T. Entrez Gene: gene-centered information at NCBI. Nucleic Acids Res. 2007;35:D26–31.

    Article  CAS  PubMed  Google Scholar 

  29. McGregor B. Medical indexing outside the National Library of Medicine. J Med Libr Assoc. 2003;90:339–41.

    Google Scholar 

  30. Miller N, Lacroix E, Backus J. MEDLINEplus: building and maintaining the National Library of Medicine’s consumer health Web service. Bull Med Libr Assoc. 2000;88:11–7.

    CAS  PubMed Central  PubMed  Google Scholar 

  31. Malet G, Munoz F, Appleyard R, Hersh W. A model for enhancing Internet medical document retrieval with “medical core metadata”. J Am Med Inform Assoc. 1999;6:183–208.

    Article  Google Scholar 

  32. Dolin R, Boles M, Dolin R, Green S, Hanifin S, Hochhalter B, et al., editors. Kaiser Permanente’s “metadata-driven” national clinical intranet. MEDINFO 2001—Proceedings of the Tenth World Congress on Medical Informatics; 2001; London: IOS Press.

    Google Scholar 

  33. Robertson W, Leadem E, Dube J, Greenberg J, editors. Design and implementation of the National Institute of Environmental Health Sciences Dublin Core Metadata schema. Proceedings of the International Conference on Dublin Core and Metadata Applications 2001; 2001; Tokyo: National Institute of Informatics (NII).

    Google Scholar 

  34. Soualmia L, Darmoni S. Combining different standards and different approaches for health information retrieval in a quality-controlled gateway. Int J Med Inform. 2005;74:141–50.

    Article  PubMed  Google Scholar 

  35. Merabti T, Lelong R, Darmoni S. InfoRoute: the CISMeF context-specific search algorithm. Stud Health Tech Inform. 2015;216:544–8.

    Google Scholar 

  36. Darmoni S, Thirion B. A standard metadata scheme for health resources. J Am Med Inform Assoc. 2000;7:108–9.

    Article  CAS  PubMed Central  PubMed  Google Scholar 

  37. Manola F, Miller E. RDF primer. Cambridge, MA: World Wide Web Consortium; 2004.

    Google Scholar 

  38. Sakr S, Wylot M, Mutharaju R, LePhuoc D, Fundulaki I. Linked data—storing, querying, and reasoning. Cham: Springer Nature; 2018.

    Google Scholar 

  39. Morrison P. Why are they tagging, and why do we want them to? Bull Am Soc Inf Sci Technol. 2007;34(1):12–5.

    Article  Google Scholar 

  40. Hammond T, Hannay T, Lund B, Scott J. Social bookmarking tools (I)—a general review. D-Lib Mag. 2005;11(4). http://www.dlib.org/dlib/april05/hammond/04hammond.html.

  41. Nandi M. Recommender systems through collaborative filtering. Domino Data Lab; 2017.

    Google Scholar 

  42. Smith B, Linden G. Two decades of recommender systems at Amazon.com. IEEE Internet Comput. 2017;2017:12–8.

  43. Caplan E, Rosenthal N. Collaborative filtering: an interim approach to identifying clinical doppelgängers. Health Affairs Blog; 2013.

    Google Scholar 

  44. Shen F, Liu S, Wang Y, Wen A, Wang L, Liu H. Utilization of electronic medical records and biomedical literature to support the diagnosis of rare diseases using data fusion and collaborative filtering approaches. JMIR Med Inform. 2018;6(4):e11301.

    Article  PubMed Central  PubMed  Google Scholar 

  45. Wiesner M, Pfeifer D. Health recommender systems: concepts, requirements, technical basics and challenges. Int J Environ Res Public Health. 2014;11:2580–607.

    Article  PubMed Central  PubMed  Google Scholar 

  46. Scott D. The new rules of marketing and PR: how to use social media, online video, mobile applications, blogs, newsjacking, and viral marketing to reach buyers directly. Hoboken, NJ: Wiley; 2017.

    Google Scholar 

  47. Soubusta S. On click fraud. Informationswissenschaft. 2008;59(2):136–41.

    Google Scholar 

  48. Hersh W, Hickam D, Haynes R, McKibbon K. A performance and failure analysis of SAPHIRE with a MEDLINE test collection. J Am Med Inform Assoc. 1994;1:51–60.

    Article  CAS  PubMed Central  PubMed  Google Scholar 

  49. Marcetich J, Rappaport M, Kotzin S, editors. Indexing consistency in MEDLINE. MLA 04 Abstracts; 2004; Washington, DC: Medical Library Association.

    Google Scholar 

  50. Salton G. Developments in automatic text retrieval. Science. 1991;253:974–80.

    Article  CAS  PubMed  Google Scholar 

  51. Luhn H. A statistical approach to mechanized encoding and searching of literary information. IBM J Res Dev. 1957;1:309–17.

    Article  Google Scholar 

  52. Kucera H, Francis W. Computational analysis of present-day American English. Providence, RI: Brown University Press; 1967.

    Google Scholar 

  53. Kalankesh L, New J, Baker P, Brass A. The languages of health in general practice electronic patient records: a Zipf’s law analysis. J Biomed Semant. 2014;5:2.

    Article  Google Scholar 

  54. Salton G, McGill M. Introduction to modern information retrieval. New York: McGraw-Hill; 1983.

    Google Scholar 

  55. van Rijsbergen C. Information retrieval. London: Butterworth; 1979.

    Google Scholar 

  56. Fox C. Lexical analysis and stop lists. In: Frakes W, Baeza-Yates R, editors. Information retrieval: data structures and algorithms. Englewood Cliffs, NJ: Prentice-Hall; 1992. p. 102–30.

    Google Scholar 

  57. Frakes W. Stemming algorithms. In: Frankes W, Baeza-Yates R, editors. Information retrieval: data structures and algorithms. Englewood Cliffs, NJ: Prentice-Hall; 1992. p. 131–60.

    Google Scholar 

  58. Harman D. How effective is suffixing? J Am Soc Inf Sci. 1991;42:7–15.

    Article  Google Scholar 

  59. Brin S, Page L. The anatomy of a large-scale hypertextual Web search engine. Comput Netw ISDN Syst. 1998;30:107–17.

    Article  Google Scholar 

  60. Yates E, Dixon L. PageRank as a method to rank biomedical literature by importance. Source Code Biol Med. 2015;10:16.

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  61. Cambazoglu B, Baeza-Yates R. Scalability challenges in web search engines. Synthesis lectures on information concepts, retrieval, and services. San Rafael, CA: Morgan & Claypool Publishers; 2015.

    Google Scholar 

  62. Koster M. A method for web robots control. San Francisco: America Online; 1996.

    Google Scholar 

  63. Castillo C, Davison B. Adversarial web search. foundations and trends in information retrieval. Delft: Now Publishers; 2011.

    Google Scholar 

  64. Henzinger M, Motwani R, Silverstein C. Challenges to Web search engines. SIGIR Forum. 2002;36:11–22.

    Article  Google Scholar 

  65. Müller H, Unay D. Retrieval from and understanding of large-scale multi-modal medical datasets: a review. IEEE Trans Multimedia. 2017;19(9):17099710.

    Article  Google Scholar 

  66. Li Z, Zhang X, Müller H, Zhang S. Large-scale retrieval for medical image analytics: a comprehensive review. Med Image Anal. 2018;43:66–84.

    Article  PubMed  Google Scholar 

  67. Mongkolwat P, Kleper V, Talbot S, Rubin D. The National Cancer Informatics Program (NCIP) Annotation and Image Markup (AIM) foundation model. J Digit Imaging. 2014;27:692–701.

    Article  PubMed Central  PubMed  Google Scholar 

  68. Kahn C, Thao C. GoldMiner: a radiology image search engine. Am J Roentgenol. 2007;188:1475–8.

    Article  Google Scholar 

  69. Wang K. Standard lexicons, coding systems and ontologies for interoperability and semantic computation in imaging. J Digit Imaging. 2018;31:353–60.

    Article  PubMed Central  PubMed  Google Scholar 

  70. Heath B, McArthur D, McClelland M, Vetter R. Metadata lessons from the iLumina digital library. Commun ACM. 2005;48(7):68–74.

    Article  Google Scholar 

  71. Hersh W, Bhupatiraju R, Greene P, Smothers V, Cohen C, editors. Adopting e-learning standards in health care: competency-based learning in the medical informatics domain. Proceedings of the AMIA 2006 Annual Symposium; 2006; Washington, DC: American Medical Informatics Association.

    Google Scholar 

  72. Candler C, Uijtdehaage S, Dennis S. Introducing HEAL: the Health Education Assets Library. Acad Med. 2003;78:249–53.

    Article  PubMed  Google Scholar 

  73. Ambite J, Fierro L, Geigl F, Gordon J, Burns G, Lerman K, et al., editors. BD2K ERuDIte: the educational resource discovery index for data science. Proceedings of the 26th International Conference on World Wide Web Companion; 2017; Perth.

    Google Scholar 

  74. Calvin-Naylor N, Jones C, Wartak M, Blackwell K, Davis J, Unsworth K, et al. Education and training of clinical and translational study investigators and research coordinators: a competency-based approach. J Clin Trans Sci. 2017;1:16–25.

    Article  Google Scholar 

  75. Hornung C, Jones C, Calvin-Naylor N, Kerr J, Sonstein S, Hinkley T, et al. Competency indices to assess the knowledge, skills and abilities of clinical research professionals. Int J Clin Trials. 2018;5:46–53.

    Article  Google Scholar 

  76. Vasilevsky N, Brush M, Paddock H, Ponting L, Tripathy S, Larocca G, et al. On the reproducibility of science: unique identification of research resources in the biomedical literature. PeerJ. 2013;5(1):e148.

    Article  Google Scholar 

  77. Vasilevsky N, Johnson T, Corday K, Torniai C, Brush M, Segerdell E, et al. Research resources: curating the new eagle-i discovery system. Database. 2012;2012:bar067.

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  78. McMurry J, Juty N, Blomberg N, Burdett T, Conlin T, Goble C, et al. Identifiers for the 21st century: how to design, provision, and reuse persistent identifiers to maximize utility and impact of life science data. PLoS Biol. 2017;15(6):e2001414.

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  79. Sansone S, Gonzalez-Beltran A, Rocca-Serra P, Alter G, Grethe J, Xu H, et al. DATS: the data tag suite to enable discoverability of datasets. Sci Data. 2017;4:170059.

    Article  PubMed Central  PubMed  Google Scholar 

  80. Qiu J. Scientific publishing: identity crisis. Nature. 2008;451:766–7.

    Article  CAS  PubMed  Google Scholar 

  81. Frakes W, Baeza-Yates R, editors. Information retrieval: data structures and algorithms. Englewood Cliffs, NJ: Prentice-Hall; 1992.

    Google Scholar 

  82. Wartik S, Fox E, Heath L, Chen Q. Hashing algorithms. In: Frakes W, Baeza-Yates R, editors. Information retrieval: data structures and algorithms. Englewood Cliffs, NJ: Prentice-Hall; 1992. p. 293–362.

    Google Scholar 

  83. Barroso L, Dean J, Hölzle U. Web search for a planet: the Google Cluster Architecture. IEEE Micro. 2003;23(2):22–8.

    Article  Google Scholar 

  84. Dean J, Ghemawat S. MapReduce: simplified data processing on large clusters. Commun ACM. 2008;51(1):107–13.

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to William Hersh .

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Hersh, W. (2020). Indexing. In: Information Retrieval: A Biomedical and Health Perspective. Health Informatics. Springer, Cham. https://doi.org/10.1007/978-3-030-47686-1_4

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-47686-1_4

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-47685-4

  • Online ISBN: 978-3-030-47686-1

  • eBook Packages: MedicineMedicine (R0)

Publish with us

Policies and ethics