In the 1950s the clinical data in medical records of patients in the United States were mostly recorded in a natural, English-language, textual form. This was commonly done by physicians when recording their notes on paper sheets for a patient’s medical history and physical examination, for reporting their interpretations of x-ray images and electrocardiograms, and for their dictated descriptions of medical and surgical procedures. Such patients’ data were generally recorded by health-care professionals as hand-written notes, or as dictated reports that were then transcribed and typed on paper sheets, that were all collated in paper-based charts; and these patients’ medical charts were then stored on shelves in the medical record room. The process of manually retrieving data from patients’ paper-based medical charts was always cumbersome and time consuming. An additional frequent problem was when a patient was seeing more than one physician on the same day in the same medical facility; then that patient’s paper-based chart was often left in the first doctor’s office, and therefore was not available to the other physicians who then had to see the patient without having access to any recorded prior patient’s information. Pratt (1974) observed that the data a medical professional recorded and collected during the care of a patient was largely in a non-numeric form, and in the United States was formulated almost exclusively in English language. He noted that a word, a phrase, or a sentence in this language was generally understood when spoken or read; and the marks of punctuation and the order of the presentation of words in a sentence represented quasi-formal structures that could be analyzed for content according to common rules for: (a) the recognition and validation of the string of language data that was a matter of morphology and syntax; (b) the recognition and the registration of each datum and of its meaning that was a matter of semantics; and (c) the mapping of the recognized, defined, syntactical and semantic elements into a data structure reflected the informational content of the original language data string, and (d) that these processes required definition and interpretation of the information by the user.

In the 1960s when computer-stored medical databases began to be developed, it was soon recognized that a very difficult problem was how to process in the computer in a meaningful way, the large amount of free-form, English-language, textual data that was present in almost every patient’s medical record; most commonly recorded in patients’ histories, in dictated surgery-operative reports, pathology reports, and in the interpretations of x-rays and electrocardiograms. In some clinical laboratory reports, such as for microbiology, descriptive textual data was often required, and had to be keyed into the computer by the technologist using a full-alphabet keyboard, or by selecting codes or names for standard phrases from a menu that could be entered by specially designed keyboards or by selecting from a visual displayed menu (Williams and Williams 1974; Lupovitch et al. 1979; Smith and Svirbely 1988). It was evident that the development of natural language processing (NLP) programs were essential, since textual data: (1) was generally unstandardized and unstructured, (2) was often difficult to interpret, (3) required special computer programs to search and retrieve, and (4) narrative text required more storage space than did digital numbers or letters. To help overcome these problems, English-language words and phrases were often converted into numerical codes; and coding procedures were developed to provide more uniform, standardized agreements for terminology, vocabulary, and meaning. These were followed by the development of computer programs for automated encoding methods; and then by special query and retrieval languages for processing textual data. In machine translation of data, the purpose of recognizing the content of an input natural-language string is to accurately reproduce the content in the output language. In information retrieval these tasks involved the categorization and organization of the information content for its use by others in a variety of situations. However, since for the automatic processing of medical textual data, the required well-formed syntactical language was rare, then syntactic/semantic language programs needed to be developed.

Natural language processing (NLP) by computers began to evolve in the 1980s as a form of human-computer interaction. There are many spoken languages in this world; but this book only considers English language text, and uses NLP to represent only natural (English) language processing. NLP was defined by Obermeier (1987) at Battelle Laboratories, Columbus, OH, as the ability of a computer to process the same language that humans used in their normal discourse. He considered the central problems for NLP were: (a) how to enter and retrieve uncoded natural-language text; and (b) how to transform a potentially ambiguous textual phrase into an unambiguous form that could be used internally by the computer database. This transformation involved the process of combining words or symbols into a group that could be replaced by a code or by a more general symbol. Different types of parsers evolved which were based on pattern matching, on syntax (grammar), on semantics (meaning), on knowledge bases, or on combinations of these methods. Hendrix and Sacerdota (1981) at SRI International, described the complex nature of NLP as requiring the study of sources of: (1) lexical knowledge that is concerned with individual words, the parts of speech to which they belong, and their meanings; (2) syntactic knowledge that is concerned with the grouping of words into meaningful phrases; (3) semantic knowledge that is concerned with composing the literal meaning of syntactic units from the semantics of their subparts; (4) discourse knowledge that is concerned with the way clues from the context being processed are used to interpret a sentence; and (5) domain knowledge that is concerned with how medical information constrains possible interpretations.

Clearly NLP had to consider semantics since medical language is relatively unstandardized, it has many ambiguities and ill-defined terms; and often has multiple meanings of the same word. Wells (1971) offered as an example of semantically equivalent phrases: muscle atrophy, atrophy of muscle, atrophic muscle, and muscular atrophy. In addition NLP had to consider syntax, or the relation of words to each other in a sentence; such as when searching for strings of words, such as “mitral stenosis and aortic insufficiency”, where the importance of the ordering of these words is evident since the string, “mitral insufficiency and aortic stenosis”, has a very different meaning. Similarly, the phrase “time flies for house flies” made sense only when one knew that the word “flies” was first a verb and then a noun. Inconsistent spelling and typographic errors also caused problems with word searches made by a computer program that exactly matched letter-by-letter. Pryor et al. (1982) also observed that the aggregate of data collected by many different health-care professionals provided the basic information stored in a primary clinical database; and to accurately reflect their accumulated experience required that all of their observations had to be categorized and recorded in a consistent and standardized manner for all patients’ visits. To facilitate the retrieval of desired medical data, Pryor advocated that a clinical database needed to incorporate a coded data-entry format. Johnson et al. (2006) also considered structured data-entry and data-retrieval to be basic tools for computer-assisted documentation that would allow a physician to efficiently select and retrieve from a patient’s record all data relevant to the patient’s clinical problems; and also to be able to retrieve supplementary data from other sources that could be helpful in the clinical-decision process; and to be able to enter into the computer any newly acquired data, and then generate a readable report.

McCray (1987, 1998) at the National Library of Medicine (NLM) described the medical lexicon as the embodiment of information about medical terms and language, and it served as the foundation for natural language processing (NLP). McCray proposed that the domain knowledge combined with lexical information and sophisticated linguistic analysis could lead to improved representation and retrieval of biomedical information and facilitate the development of NLP. McCray et al. (2001) studied the nature of strings of words found in the NLM’s UMLS Metathesaurus (see Sect. 9.1.1), and studied their usefulness in searching articles in the NLM’s MEDLINE database. Their studies indicated that the longer the string of words, for example more than four words, the less likely it would be found in the body of the text and therefore less likely to be useful in natural language processing. Grams and Jin (1989) reviewed the design specifications for databases that stored natural language text (including graphs, images, and other forms of non-digital information that were collected from reference sources such as journals and text books), and could display the requested information in a user friendly, natural language format. R. Grams concluded that such a database required a companion metadatabase that defined terms, and provided a thesaurus for data that was acquired from different sources. Friedman and Hripcsak (1999), after many years of work developing a natural language processing (NLP) system, concluded that although encoded medical data was necessary for its accurate retrieval, much of the data in patients’ records were recorded in a textual form that was extremely diverse, and the meanings of words varied depending on its context; and the patients’ records were usually not readily retrievable. So efficient NLP systems were essential for processing textual data; but these systems were very difficult to develop and they required substantial amounts of relevant knowledge for each clinical domain in which they were employed.

3.1 The Development of Standard Terminologies and Codes

Medical terminologies are systemized collections of terms used in medicine to assist a person in communicating with a computer; and they require developing and using standard definitions of: (1) terms that are units of formal language such as words or numbers; (2) entities that are units of reality, such as human body sites, population groups, or components of a system or of an organization such as the radiology department in a hospital; (3) codes that are units of partitions, groups of words, letters, numbers, or symbols that represent specific items, such as medical diagnoses or procedures; (4) nominal phrases that are units of natural language; and (5) concepts that are representations of thoughts formed in the mind, that are mental constructs or representations of combined things, objects, or thoughts (Olson et al. 1995; Tuttle et al. 1995).

Ozbolt et al. (1995) reported testing manual auditors for their reliability and validity for coding standard terms they had collected from a set of 465 patients’ medical-care records that were submitted by nine hospitals. Manual auditors identified almost 19,000 items in these patients’ records as representing statements of patients’ medical problems, patients’ outcomes from care, and patient-care problems; and they found that their set of standard terms and codes matched 99.1% of these items. They concluded that this was a useful demonstration that medical terminologies could meet criteria for acceptable accuracy in coding, and that computer-based terminologies could be a useful part of a medical language system. Hogan and Wagner (1996) evaluated allowing health-care practitioners to add free-text information to supplement coded information and to provide more flexibility during their direct entry of medications. They found that the added free-text data often changed the meaning of coded data and lowered data accuracy for the medical decision-support system used with their electronic medical records (EMRs). Chute (1998) reviewed in some detail the evolution of healthcare terminologies basic to medical data-encoding systems, and how its history went back several centuries. Current terminologies and methods for encoding medical diagnoses began in the 1940s by the World Health Organization (WHO), who undertook the classifying and codifying of diseases by systematic assignment of related diagnostic terms to classes or groups. The WHO took over from the French the classification system they had adopted in 1893, and was based primarily on body site and etiology of diseases (Feinstein 1988).

Medical Subject Headings (MeSH) vocabulary file was initiated in 1960 by the National Library of Medicine (NLM) to standardize its indexing of medical terms and to facilitate the use of its search and retrieval programs. MeSH was developed primarily for the use of librarians for indexing the NLM’s stored literature citations, and was NLM’s way of meeting the problem of variances in medical terminology by instituting its own standard, controlled vocabulary. However, MeSH was not designed to serve as a vocabulary for the data in patients’ medical records. MeSH is a highly structured thesaurus consisting of a standard set of terms and subject headings that are arranged in both an alphabetical and a categorical structure, with categories further subdivided into subcategories; and within each subcategory the descriptors are arranged hierarchically. MeSH is the NLM’s authority list of technical terms used for indexing biomedical journal articles, cataloging books, and for bibliographic search of the NLM’s computer-based citation file (see also Sect. 9.1).

The International Classification of Diseases (ICD) published under the WHO sponsorship was in its sixth revision in 1948. In the 1950s medical librarians manually encoded ICD-6 codes for diagnoses. In the 1960s ICD-7 codes were generally key punched into cards for electronic data processing. The International Classification of Diseases, Adapted (ICDA) was used in the United States for indexing hospital records, and was based on ICD-8 that was published in 1967. Beginning in 1968 the ICDA began to serve as the basis for coding diagnoses data for official morbidity and mortality statistics in the United States. In addition, the payors of insurance claims began to require ICDA codes for payments; and that encouraged hospitals to enter into their computers the patients’ discharge diagnoses with their appropriate ICDA codes. The ninth revision, ICD-9, appeared in 1977; and since ICD was originally designed as an international system for reporting causes of death, ICD-9 was revised to better classify diseases. In 1978 its Clinical Modification (ICD-9-CM) included more than 10,000 terms and permitted six-digit codes plus modifiers. ICD-9-CM also included in its Volume III a listing of procedures. Throughout the three decades of the 1980s, 1990s, and 2000s, the ICD-9-CM was the nationwide classification system used by medical record librarians and physicians for the coding of diagnoses. The final versions of the ICD-9 codes were released in 2010 (CMS-2010); and the ICD-10 codes were scheduled to appear in 2011.

Chute (2010) noted that the 1996 Health Insurance Portability and Accountability Act (HIPAA) was the first time in legislative history that the healthcare industry was subjected to a mandate for data-exchange standards, such as the required use of International Classification of Diseases (ICD) codes. HIPAA gave the National Committee for Vital and Health Statistics (NCVHS) the authority to oversee health-information exchange standards, and NCVHS became the first designated committee for health information technology (HIT) standards.

The Standard Nomenclature of Diseases and Operations (SNDO), a compilation of standard medical terms by their meaning or by some logical relationship such as by diseases or operations, was developed by the New York Academy of Medicine and was published by the American Medical Association in 1933; and it was used in most hospitals in the United States for three decades. SNDO listed medical conditions in two dimensions: (1) by anatomic site or topographic category (for examples, body as a whole, skin, respiratory, cardiovascular, and so forth); and (2) by etiology or cause (for examples, due to prenatal influence, due to plant or parasite, due to intoxication, due to trauma by physical agent, and so forth). The two-dimensional SNDO was not sufficiently flexible to satisfy clinical needs, and its last (5th edition) was published in 1961.

Current Medical Terminology (CMT) was an important early contribution to the standardization of medical terminology; and it was made by Gordon (1965) and a committee of the American Medical Association to develop an alphabetical listing of terms with their definitions and simplified references. The first edition of CMT was published in 1962, with revisions in 1964 and 1965 (Gordon 1968).

Current Medical Information and Terminology (CMIT) was an expanded version of CMT in 1971 to provide a distillate of the medical record by using four-digit codes for descriptors, such as symptoms, signs, laboratory test results, x-ray and pathology reports (Gordon 1970, 1973). CMIT also defined its diagnoses terms, that was a common deficiency of SNOP, SNOMED, and ICD as all lacked a common dictionary that precisely defined their terms, and as a result the same condition could be defined differently in each and be assigned different codes by different coders (Henkind et al. 1986). An important benefit from using a common dictionary was to encourage the standardization of medical terms through their definitions, and thereby facilitate the interchange of medical information among different health professionals and also among different medical databases. Since the data stored in patients’ records came from multiple sub-system databases, such as from pathology, laboratory, pharmacy, and others, some standards for exchanging data had to be established before they could be readily transferred into a computer-based, integrated patient record. Since CMIT was available in machine-readable form, it was an excellent source of structured information for more than 3,000 diseases; so it was used by Lindberg et al. (1968b) as a computer-aid to making a diagnosis in his CONSIDER program, for searching CMIT by combinations of disease attributes; and then listing the diseases in which these attributes occurred.

Current Procedural Terminology (CPT) was first published in 1967 with a four-digit coding system for identifying medical procedures and services primarily for the payment of medical claims; but it was soon revised and expanded to five-digit codes to facilitate the frequent addition of new procedures (Farrington 1978). Subsequently, the American Medical Association provided frequent revisions of CPT; and in the 1970s and 1980s CPT-4 was the most widely accepted system of standardized descriptive terms and codes for reporting physician-provided procedures and services under government and private health-insurance programs. In 1989 the Health Care Financing Organization (HCFA) began to require every physician’s claim for payment of services provided to patients seen in medical offices to include ICD-9-CM code numbers for diagnoses, and also to report CPT-4 codes for procedures and services (Roper et al 1988, Roper 1989).

The Systemized Nomenclature of Pathologists (SNOP), a four-dimensional nomenclature intended primarily for use by pathologists, was developed by a group within the American College of Pathologists led by A. Wells, and was first published in 1965. SNOP coded medical terms into four TMEF categories: (1) Topography (T) for the body site affected, (2) Morphology (M) for the structural changes observed, (3) Etiology (E) for the cause of the disease, and (4) Function (F) for the abnormal changes in physiology (Wells 1971). Thus a patient with lung cancer who smoked cigarettes and had episodes of shortness of breath at night would be assigned the following string of SNOP terms: T2600M8103 (bronchus, carcinoma); E6927 (tobacco-cigarettes); F7103 (paroxysmal nocturnal dyspnea) (Pratt 1973). Complete, as well as multiple, TMEF statements were considered to be necessary for pathologists’ purposes (Graepel et al. 1975).

The result of these applications was the translation of medical text into the four fields (T, M, E, and F) as listed in the SNOP dictionary. The successful use of SNOP by pathologists encouraged R. Cote, G. Gantner, and others to expand SNOP to attempt to encompass all medical specialties. In the 1960s the use of SNOP was generally adopted by pathologists, as it was well suited for coding data for computer entry when using punched cards. In the 1970s it was the basis for the development of computer programs to permit automatic SNOP encoding of pathology terms (Pratt 1971, 1973, 1974).

The Systemized Nomenclature of Medicine (SNOMED) was first published in 1977 (SNOMED 1977). In addition to SNOP’s four fields of Topography (T), Morphology (M), Etiology (E), and Function (F), SNOMED contained three more fields: (1) Disease (D) for classes of diseases, complex disease entities, and syndromes, which made SNOMED as suitable for statistical reporting as the ICD; (2) Procedure (P) for diagnostic, therapeutic, preventive, or administrative procedures; and (3) Occupation (O) for the patient’s occupational and industrial hazards (Cote 1977, 1986; Gantner 1980). Some reports compared SNOMED and ICD, and advocated SNOMED as being superior for the purposes of medical care and clinical research, since ICD was designed primarily for statistical reporting and its codes were often too general to identify specific patient problems. In addition SNOMED defined the logical connections between the categories of data contained in the final coded statement; and SNOMED codes could be used to generate ICD codes, but not vice versa (Graepel 1976).

The Systemized Nomenclature of Human and Veterinary Medicine (SNOMED-International) was reported by Lussier et al. (1998) to have been under development since the 1970s; and SNOMED-International (version 2) had appeared in 1979. Rothwell and Cote (1990) proposed that SNOMED-International (version 3) was more modular, systemized, and contained linkages among terms so that it could serve as a conceptual framework for the representation of medical knowledge; and also could support NLP. Rothwell and Cote (1996) further described SNOMED International as having the objective of providing a robust, controlled vocabulary of medical terms and concepts that encompassed the entire domains of human and veterinary medicine. In 1996 the SNOMED International (version 3.3) used 11 primary term codes: Topography (T); Morphology (M); Etiology (E); Function (F); Living organisms (L); Chemicals, drugs and biological products (C); Physical Agents, forces and Activities (A); Occupations (J); Social context (S); Disease/Diagnosis (D); Procedures (P); and General linkage/modifiers (G). Mullins et al. (1996), compared the level of match when using three clinical vocabularies: SNOMED International, Read Codes, and NLM’s UMLS, for coding 144 progress notes in a group of ambulatory, family practice, clinical records. They reported significant differences in the level of match for the three coding systems; and that SNOMED performed at the highest level of good matches, UMLS next, and Read at the lowest level; and they recommended additional studies to better standardize coding procedures. Campbell et al. (1998), tested a version of SNOMED-International at several large medical centers, and concluded that it could adequately reconcile different database designs and efficiently disseminate updates that were tailored for locally enhanced terminologies.

The Systemized Nomenclature of Medicine Reference Terminology (SNOMED-RT) was also developed by the College of American Pathologists (CAP) to serve as a common reference terminology for the aggregation and retrieval of health care information that had been recorded by multiple individuals and organizations (Stearns et al. 2001). Dolin et al. (2001) described the SNOMED-RT Procedure Model as providing an advanced hierarchical structure with poly-hierarchies representing super-types and sub-types relationships; and that included clinical actions and healthcare services, such as surgical and invasive procedures, courses of therapy, history taking, physical examinations, tests of all kinds, monitoring, administrative and financial services. SNOMED Clinical Terms (SNOMED-CT), was developed in 1999 when the similarities were recognized between SNOMED-RT and the National Health Service of the United Kingdom that had developed its own Clinical Terms Version 3 that evolved from the Read Codes CTV3. Spackman (2005) reported on 3 years use of this clinical terminology, and described changes in SNOMED-CT that included removing duplicate terms, improving logic definitions, and revising conceptual relationships.

Problems with inconsistencies in the various medical terminologies soon became apparent. Ward et al. (1996) described the need for associations of health-care organizations to be able to maintain a common database of uniformly coded health outcomes data; and reported the development of the Health Outcomes Institute (HOI) with their uniquely coded, medical-data elements. In 2004 the National Health Information Infrastructure (NHII) was initiated to attempt to standardize information for patients’ electronic medical records (EMRs); and it recommended the standard terminologies for EMRs to be the Systemized Nomenclature of Medicine (SNOMED) and the Logical Observation Identifiers Names and Codes (LOINC). The National Cancer Institute (NCI) developed the Common Data Elements (CDEs) to define the data required for research in oncology (Niland et al. 2006). The convergence of medical terminologies became an essential requirement for linking multiple databases from different sources that used different coding terminologies. In 1987 the National Library of Medicine (NLM) initiated the development of a convergent medical terminology with its Unified Medical Language System (UMLS), that included a Semantic Network of interrelated semantic classes, and a Metathesaurus of interrelated concepts and names that supported linking data from multiple sources. UMLS attempted to compensate for differences in terminology among different systems such as MeSH, CMIT, SNOP, SNOMED, and ICD. UMLS was not planned to form a single convergent vocabulary, but rather to unify terms from a variety of standardized vocabularies and codes for the purpose of improving bibliographic literature retrieval, and to provide standardized data terms for computer-based information. Humphreys (1989, 1990) at NLM, described UMLS as a major NLM initiative designed to facilitate the retrieval and integration of information from many machine-readable information sources, including the biomedical literature, factual databases, and knowledge bases (see also Sect. 9.1.1). Cimino and Barnett (1990) studied the problem of translating medical terms between four different controlled terminologies: NLM’s MeSH, International Classification of Diseases (ICD-9), Current Procedural Terminology (CPT-4), and the Systemized Nomenclature of Medicine (SNOMED). When a user needed to translate a free-text term from one terminology to another, the free-text term was entered into one system that then presented its list of controlled terms, and the user selected the most correct term; but if the user did not recognize any of the presented terms as a correct translation then the user could try again. It was recognized that an automatic translation process would be preferable for the conversion of terms from one system to another. They created a set of rules to construct a standard way of representing a medical term that denoted semantic features of the term by establishing it as an instance of a class, or even more specifically of a subclass that inherited all of the required properties. They developed an algorithm that compared matches of a subset of terms for the category of “procedures”, and reported that matches from ICD-9 to the other terminologies appeared to be “good” 45% of the time; and that when a match was “suboptimal” (55% of the time) the reason was that ICD-9 did not contain an appropriate matching term. They concluded that the development of a common terminology would be desirable.

Cimino (1994) and associates at Columbia University also addressed some of the inconsistencies in terms in different terminologies, and emphasized the necessity for a controlled, common medical terminology that was capable of linking and converging data from medical applications in different hospital departmental services, from different patient-record systems, and also from knowledge-based systems and from medical literature databases. They proposed as criteria for a controlled medical terminology: (1) domain completeness, so it did not restrict the depth or breadth of the hierarchy; (2) nonredundancy, to prevent multiple terms being added for the same concept; (3) synonymy, to support multiple non-unique names for concepts; (4) nonvagueness, each concept must be complete in its meaning; (5) nonambiguity, each concept must have exactly one meaning; (6) multiple classification, so that a concept can be assigned to as many classes as required; (7) consistency of views, in that concepts in multiple classes must have the same attributes in each concept; and (8) explicit relationships, in that meanings of inter-concept relationships must be clear. Cimino (1998) further added as being desirable, that: controlled medical vocabularies should provide an expandable vocabulary content; they should be able to quickly add new terms as they arise; be able to change with the evolution of medical knowledge; should consider the unit of symbolic processing to be the concept, that is the embodiment of a particular meaning; that vocabulary terms must correspond to only one meaning, and meanings must correspond to only one term; the meaning of a concept must be permanent, but its name can change when, for example, a newer version of the vocabulary is developed; and that controlled medical vocabularies should have hierarchical structures, and although a single hierarchy is more manageable, polyhierarchies may be allowed; that multipurpose vocabularies may require different levels of granularity; and that synonyms of terms should be allowed, but redundancy, such as multiple ways to code a term should be avoided. Cimino (1994, 1995, 1998) applied their criteria for a convergent terminology to their Medical Entities Dictionary (MED) that they developed for their centralized clinical information system at Columbia University. MED included subclassification systems for their ancillary clinical services, including the clinical laboratory, pharmacy, and electrocardiography. MED was a MUMPS-based, hierarchical data structure, with a vocabulary browser and a knowledge base. Since classes of data provided within their ancillary systems were inadequate for the MED hierarchy for both the multiple classification criteria and for its use in clinical applications, a subclassification function was added to create new classes of concepts. By the mid-1990s MED contained 32,767 concepts; and it had encoded six million procedures and 48-million test results for more than 300,000 patients. Mays (1996) and associates at the IBM T. J. Watson Research Center in Yorktown Heights, New York, described their K-Rep system based on description logic (DL) that considered its principal objects of representation to be concepts, such as laboratory tests, diagnostic procedures, and others; and that concepts could include sub-concepts, such as the concept of a chemistry test could include the sub-concept of a serum sodium test, and thereby enabled an increased scalability of concepts. They considered conceptual scalability to be an enhancement of system scalability; and their strategy allowed multiple developers to concurrently work on overlapping portions of the terminology in independent databases. Oliver et al. (1995, 1996) reported the formation of the InterMed Collaboratory that consisted of a group of medical informaticians with experience in medical terminology with the objective of developing a common model for controlled medical vocabularies.

Convergent Medical Terminology (CMT) was developed by a group from Kaiser Permanente, the Mayo Clinic, and Stanford University who addressed the objective of achieving a convergence of some different existing terminologies to better support the development of informatics applications and to facilitate the exchange of data using different terminologies. They had found that some medical terminologies, such as SNOMED International and ICD-9-CM, used a hierarchical structure that organized the concepts into type hierarchies that were limiting since they lacked formal definitions for the terms in the systems, and did not sufficiently define what a term represented nor how one term differed from another (Campbell et al. 1996). Building on the experience with the K-Rep system described by Mays et al. (1996), they developed a convergent medical terminology they called Galapagos, that could take a collection of applications from multiple sites and identify and reconcile conflicting designs; and also develop updates tailored specifically for compatibility with locally enhanced terminologies. Campbell et al. (1998) further reported their applications of Galapagos for concurrent evolutionary enhancements of SNOMED International at three Kaiser Permanente (KP) regions and at the Mayo Clinic. They found their design objectives had been met, and Galapagos supported semantic-based concurrency control, and identified and resolved conflicting decisions in design. Dolin (2004) and associates at KP described the Convergent Medical Terminology (CMT) as having a core comprised of SNOMED-CT, laboratory LOINC, and First DataBank drug terminology, all integrated into a poly-hierarchical structured, knowledge base of concepts with logic-based definitions imported from the source terminologies. In 2004 CMT was implemented in KP enterprise-wide, and served as the common terminology across all KP computer-based applications for its 8.4 million members in the United States. CMT served as the definitive source of concept definitions for the KP organization; it provided a consistent structure and access method to all computer codes used by KP, with its inter-operability and cross-mappings to all KP ancillary subsystems. In 2010 KP donated the CMT to the National Library of Medicine for its free access.

Chute et al. (1999) introduced the notion of a terminology server that would mediate translations among concepts shared across disparate terminologies. They had observed a major problem with a clinical terminology server that was used by clinicians to enter patient data from different clinical services was that they were prone to use lexical variants of words that might not match their corresponding representations within the nomenclature. Chute added as desirable requirements for a convergent medical terminology: (1) word normalization by a normalization and lexical variant-generator code that replaced clinical jargon and completed abbreviated words and terms, (2) target terminology specifications for supporting other terminologies, such as SNOMED-RT or ICD-9-CM, that were used by the enterprise; (3) spell-checking and correction, (4) lexical matching of words against a library of indexed words, (5) semantic locality by making visible closely related terms or concepts; (6) term composition that brought together modifiers or qualifiers and a kernel concept; and (7) term decomposition that broke apart complex phrases into atomic components.

3.2 Encoding Textual Medical Data

Encoding text greatly simplified the search and retrieval of textual data that was otherwise done by matching letters and numbers; so when English language terms were represented by numerical codes then the textual data were entered into the computer in a readable, compact, and consistent format. The disadvantages of encoding natural language terms were that users had to be familiar with the coding system, codes had a tendency to reduce the flexibility and richness of textual data and to stereotype the information, and codes required updating and revisions for new terms or they could become obsolete (Robinson 1974, 1978). Yet the process of coding was an important early method used for natural language processing (NLP); and manual encoding methods often used special-purpose, structured and pre-coded data-entry forms. It soon became evident that efficient NLP systems needed standardized terminology and rules for coding, aggregating, and communicating textual data; and needed automated encoding methods.

Automated encoding of textual data by computer became an important goal since the manual coding of text was a tedious and time-consuming process that led to inconsistent coding; so efforts were soon directed to developing NLP software for automatic encoding by computer. Bishop (1989) defined its requirements to be: a unique code for each term (word or phrase), each code needed to be defined, each term needed to be independent, synonyms should be equitable to the code of their base terms, each code could be linked to codes of related terms, the system should encompass all of medicine and be in the public domain, and the format of the knowledge base should be described completely in functional terms to make it independent of the software and hardware used. It was also apparent that the formalized structuring and encoding of standardized medical terms would provide a great savings of storage space and would improve the effectiveness of the search and retrieval process for textual data. Automated data encoding, as the alternative to manual encoding, needed to capture the data electronically as it occurred naturally in a clinical practice, and then have a computer do the automated data encoding. Tatch (1964), in the Surgeon General’s Office of the U.S. Army, reported automatically encoding diagnoses by punching paper tape as a by-product of the normal typing of the clinical record summary sheet. The computer program operated upon actual words within selected blocks, one word at a time, and translated each letter in the word into a unique numeral; the numeral was matched to an identification table and an identity code was appended to the numeral. Based on a syntax code, the numerals were added one-at-a-time, until a diagnostic classification was determined. The diagnostic code related to the final sum was retrieved from computer memory and added to the clinical record summary.

Pratt (1975) at the National Institutes of Health (NIH), reported the automated encoding of autopsy diagnoses using the Standard Nomenclature of Pathology (SNOP). He noted that in the creation of a computer-based, natural language processing (NLP) system, it was necessary to provide for the morphological, syntactic, and semantic recognition of the input data. He used SNOP as his semantically organized dictionary, and noted that SNOP was divided into four major semantic categories: Topography (T), Morphology (M), Etiology (E), and Function (F). He further defined additional semantic subcategories and morphemes (the smallest meaningful parts of words) to permit the successful identification of word forms that were not found in the SNOP dictionary, and also to help in the recognition of medical synonyms. He developed parsing algorithms for morphological, syntactic, and semantic analyses of autopsy diagnoses; and he developed a computer program which, when given as input a body of medical text, produced as output a linguistic description and semantic interpretation of the given text (Pratt and Pacak 1969a, b). Whiting-O’Keefe (1983) and associates at the University of California in San Francisco, reported a system that automatically encoded patients’ data from their medical records. A computer program was developed that extracted partially encoded patient data that had been gathered by the Summary Time Oriented Record (STOR) system for ambulatory patients, and converted it to fully encoded data. The primary display of the STOR system was a time-sequenced flow sheet. Much of the data captured was structured, which could be viewed as a form of partial data coding, and this made the automated- encoding system feasible. Their coding program allowed a user to develop a set of coding specifications that determined what data, and how the data in the STOR database, was to be coded. In July 1983 the first machine-encoded data was passed from the STOR system to the ARAMIS database (see also Sect. 5.3).

Demuth (1985) described the earliest approaches that had been used to develop automated data-encoding systems included: (1) A language-based system that matched English words against a dictionary, and if a match or an accepted synonym was found, it was then assigned a code. (2) A knowledge-based or expert system that included the domain of knowledge recorded by experts for whom the particular data system was intended; and the expert system attempted to mimic the reasoning and logic of the users. Hierarchical, tree-based, decision systems tried to automate human reasoning and logic by using simple queries and responses; and the decision-tree design mandated the nature and order of the questions to be asked, and how they were to be answered. Demuth concluded that an automated coding system had to possess characteristics of both a language-based and a knowledge-based system in order to provide the feedback necessary to help a medical records professional arrive at the correct codes. Gabrieli (1987) developed an office information system called “Physicians’ Records and Knowledge Yielding Total-Information for Consulting Electronically” (PRAKTICE) for processing natural language text in medical records (Gabrieli 1984). Gabrieli developed a computer-compatible, medical nomenclature with a numeric representation, where the location of a term in a hierarchical tree served as the code. For example, the diagnosis of polycythemia was represented by 4-5-9-1-2, where 4  =  clinical medicine, 4-5  =  a diagnostic term, 4-5-9  =  hematologic diagnostic term, 4-5-9-1  =  red cell disorder, and 4-5-9-1-2  =  polycythemia. He also developed a lexicon that contained more than 100,000 terms. He used his system for processing medical text; and described his method as beginning with a parser that recognized punctuation marks and spaces, and then broke down each sentence into individual words while retaining the whole sentence intact for reference. Each word was numbered for its place in the sentence, and then matched against his word lexicon, and given a grammatical classification (noun, verb, etc.) and a semantic characterization (grouped among “clue” medical words, modifiers, or others). The program then looked for any words near to the medical term that might be modifiers altering its meaning (usually adjectives). Thus, the term “abdominal pain” might be preceded by a modifier such as “crampy abdominal pain”. The remaining words were then analyzed for their relationship to the other words in the sentence. Powsner (1987) and associates at Yale University reported on their use of semantic relationships between terms by linking pairs of related terms to try to improve coding and retrieving clinical literature. They found that defining semantic relationships for certain pairs of terms could be helpful; but multiple semantic relationships could occur in the clinical literature that was strongly dependent upon the clinical specialty. In the 1990s and the 2000s more advanced NLP systems were developed for both the automated encoding and the automated querying of uncoded textual medical data (see next Sect. 3.3).

3.3 Querying Textual Medical Data

The approaches to automatic encoding of textual data led to the development of methods for the automated retrieval of encoded textual data, and then for the much more difficult process of automated retrieval of uncoded textual data. The earliest retrieval of stored uncoded textual data by the matching of words and phrases within the text, such as used for a key-word-in-context (KWIC) search (Kent 1966), led to pattern matching of word strings (Yianilos 1978). Early automated query systems attempted to match a word with a similar word in their own data dictionary or lexicon; and if no direct match was found the system then searched for a synonym listed in their lexicon that could be accepted by the user. Ideally what was needed was a natural-language processing (NLP) system that could automatically interact with the computer while using English language text. Certainly the fluent use of the English language was markedly different from structured computer languages. Computers readily surpassed humans at processing strings of numbers or letters; however, people found it more effective to communicate using strings of words and phrases. The approach of matching words and phrases was useful for processing some highly structured uncoded text; however, this method still ignored the syntax of sentences and thereby missed the importance of the locations of words within a sentence and of the relations between words.

Hersh (1998a) reviewed the evolution of natural language processing (NLP) for information retrieval systems, and noted that they were among the earliest medical informatics applications; and he defined information retrieval systems as systems to catalog and provide information about documents. Querying a medical database involved accessing, selecting, and retrieving the desired data; and this was an essential function for a medical database. This usually required transforming the query so it could be executed by the computer by using special programs to retrieve the selected data; and this required developing standards for the uniform collection, storage, and exchange of data. Blois (1982) emphasized that special programming languages were required to reach into a database and draw together desired subgroups of patients’ data; and then to specify the desired operation to be performed on the data. Blois proposed that the detailed needs of such retrieval languages could be met either by using a form composed on the screen (query-by-form); or by a series of selections from a displayed “menu” of terms or phrases; or by the use of a natural-language, front-end, computer program that converted a question expressed in English into a formal query language; and then execute it by the computer database-system programs. Broering et al. (1989) noted that without computer help, users had to develop their own sets of rules to search, retrieve, and reconcile data from multiple databases; and as the numbers of databases increased, it became much more difficult to manage all of the different rules between databases, so automated programs for querying data became a necessity. Hersh and Donohue (1998b) observed that in 1966 when the National Library of Medicine (NLM) launched its MEDLINE, it initially required specially trained users and a several-week turn-around time for a response to a mailed search statement. In 1997 NLM announced its Web-based MEDLINE and PubMed with easy-to-use interfaces (see Sect. 9.1.1).

The ability to query uncoded natural language text was essential for the retrieval of many textual reports of clinical procedures and tests, and of physicians’ dictated surgery operative reports, pathology reports, x-ray and electrocardiogram interpretations, and for some clinical laboratory reports such as for microbiology that often required descriptive textual data rather than numeric data (Williams and Williams 1974; Lupovitch et al. 1979; Levy and Lawrance 1992). Eden (1960) noted that as medical databases increased in size, it took more time to conduct a search by the method of querying by key words; and it was obvious that there was a need to develop computer programs that could efficiently conduct automatic query and search programs for textual data in databases. In 1959 one of the earliest programs for the search and retrieval of data for medical research was developed by J. Sweeney and associates at the University of Oklahoma, and it was called “General Information Processing System” (GIPSY). GIPSY was designed to permit the user, without any additional programming, to browse through the database, to pose complex queries against any of the stored data, and to obtain answers to ad-hoc inquiries from the assembled information. GIPSY was used at the University of Oklahoma as the primary support in projects concerning analysis of patients’ psychiatry records (Addison et al. 1969). Nunnery (1984) reported that in 1973 GIPSY was modified for use by health professionals and was then called “Medical Information Storage System” (MISSY); and it was then used for some epidemiological studies. In 1982 a microcomputer-based system called “MICRO-MISSY”, with more statistical procedures, was written in Microsoft BASIC and used CP/M operating system. In the 1960s a relatively simple method for entering and retrieving uncoded textual data without encoding the data was to enter words, phrases, or sentences into a computer, and then retrieve such text by entering into the computer the exact matching of letter-by-letter, or word-by-word, or phrase-by-phrase. This method of natural language processing (NLP) was generally referred to as the “ key-word-in-context” (KWIC) approach. In the 1960s an early way of applying this KWIC method was by using an IBM Magnetic Tape/Selectric Typewriter (MT/ST) that was interfaced to a magnetic tape drive connected to a digital computer. Robinson (1970) used such a system to enter narrative surgical-pathology reports; and at the time of the transcription, the MT/ST system permitted the information to be entered into the computer by the typewriter, and the computer program then matched each word against a standard vocabulary, and also identified new or misspelled words for editing.

In the early 1960s G. Barnett and associates at the Massachusetts General Hospital (MGH) implemented their laboratory information system; and in 1971 they developed their Computer-Stored Ambulatory Record (COSTAR) system (see also Sect. 4.2). In 1979 they developed the Medical Query Language (MQL) that was used to query their databases that were programmed with the MGH Utility Multiprogramming System (MUMPS) language. They structured the narrative textual data, such as commonly found in physicians’ progress notes, by using an interactive, conversational technique with a predetermined branching structure of the data, and also using a fixed vocabulary. The user entered the query by selecting the desired items from a list on a display screen (Barnett and Hoffman 1968; Barnett et al. 1969). MQL was used for the retrieval and analysis of data from their COSTAR ambulatory patients’ records. A MQL query was made up of a series of statements, and each statement began with a keyword. MQL queries could be indefinitely long, or could be broken down into a series of sub-queries with each designed to accomplish some portion of the total problem. The statement was scanned and passed on to a parser that matched the scanned symbols to rules in the MQL grammar, and then the program went on to execute the search. MQL permitted non-programmer users to submit complex, branching-logic queries that could be intricate and indefinitely long; and could be broken down into a series of sub-queries; each designed to accomplish some portion of the total problem. MQL had capabilities for cross-tabulation reports, scatter plots, online help, intermediate data storage, and system maintenance utilities (Morgan et al. 1981; Shusman et al. 1983; Webster et al. 1987). Murphy et al. (1999) reviewed 16-years of COSTAR research queries that used MQL to search a large relational data warehouse, and reported that MQL was more flexible than SQL for searches of clinical data.

Also in the early 1960s H. Warner and associates at the University of Utah LDS Hospital, used their database, Health Evaluation through Logical Processing (HELP), they had developed for patient care, also for clinical-decision support, and for clinical research (see also Sect. 4.2). They stored the patient-care data in sectors organized in groups dealing with specific subsets of potential medical decisions; and they developed a query program to search and format the requested data. To permit a rapid, interactive response-time, their query functions were run on a microcomputer that communicated with their central computer system. The HELP database was also used for alert reports from their laboratory, pharmacy, and radiology subsystems (Haug and Warner 1984). Ranum (1988) described their NLP approach to radiology reports that were typically presented in a typewritten format. They had formerly created a list of common x-ray reports from which the radiologist selected and checked the one most appropriate for a patient’s x-ray, or had the option of entering by text a different report, They developed a knowledge-based, data-acquisition tool they called Special Purpose Radiology Understanding System (SPRUS), that operated within their HELP system, and contained knowledge bases for common conditions, beginning with frames of data for 29 pulmonary diseases. Haug et al. (1994), described their further development of NLP for chest x-ray reports with a new system they called Symbolic Text Processor (SymText), that combined a syntactic parser with a semantic approach to concepts dealing with the various abnormalities seen in chest x-rays, including medical diseases, procedural tubes and treatment appliances; and then generated output for the radiologists’ reports to be stored in the patients’ medical records. Warner et al. (1995) described their multi-facility system as one using a controlled vocabulary, and allowing direct entry of structured textual data by clinicians (see also Sect. 4.3).

In 1962 Lamson et al. (1965), at the University of California, Los Angeles, was entering surgical pathology diagnoses in full English language text into a computer-based, magnetic-file storage system. The information was keypunched in English text in the exact form it had been dictated by the pathologists. A patient’s record was retrieved by entering the patient’s name or identification number, and a full prose printout of the pathologist’s diagnosis was then provided. To avoid manual coding, Lamson collected 3 years of patients’ data into a thesaurus that related all English words with identifiable relationships. His computer program matched significant words present in a query, and then retrieved patients’ records, which contained these words. In 1965 his patients’ files contained about 16,000 words and his thesaurus contained 5,700 English words. His thesaurus contained hierarchical and synonymous relationships of terms; so as for example, to be able to recognize that “dyspnea” and “shortness-of-breath” were acceptable synonyms (Jacobs 1967, 1968). It was recognized that more programming would be necessary to provide syntactic tests that could help to clear up problems of a syntactic nature; so Lamson, working with Jacobs and Dimsdale from IBM, went on to develop a natural-language retrieval system that contained a data dictionary for encoded reports from surgical pathology, bone-marrow examinations, autopsies, nuclear medicine, and neuroradiology, with unique numeric codes for each English word (Okubo et al. 1975). Patients’ records were maintained in master text files, and new data were merged in the order of patients’ medical record numbers. A set of search programs produced a document that was a computer printout of the full English text of the initial record in an unaltered, unedited form. However, Lamson recognized that more programming was necessary to clear up both semantic and syntactic problems. In 1963 Korein and Tick at New York University Medical Center, designed a method for storing physician’s dictated, uncoded narrative, text in a variable-length, variable-field format. The narrative data were then subjected to a program that first generated an identifier and location of every paragraph in the record; and then reformatted the data on magnetic tape with the data content of the document converted into a list of words and a set of desired synonyms. On interrogation the program would search for the desired words or synonyms, and then would retrieve the selected text. This technique of identifying key words served as a common approach to retrieving literature documents (Korein 1970; Korein et al. 1963; 1966).

Buck (1966), in D. Lindberg’s group at the University of Missouri at Columbia, described their program for retrieving patients’ records, from computer files that included the coded patients’ discharge diagnoses, surgery reports, surgical pathology and cytology reports, and the interpretations of electrocardiograms and x-rays. The diagnoses files were stored on magnetic tape in a fixed-field format, and processed by an IBM 1410 computer system. Queries were entered from punched cards containing the code numbers of the diagnoses to be retrieved. The computer searched the magnetic-tape files that in 1966 contained more than 500,000 patients’ records, for the diagnoses, and then identified the medical-record numbers of the patients’ records that contained the desired diagnoses. Lindberg (1968a, b) also developed a computer program called CONSIDER, that allowed a query from a remote computer terminal to search, match, and retrieve material from the Current Medical Terminology knowledge database that contained definitions of more than 3,000 diseases. The CONSIDER program was interactive in that it allowed the user to retrieve lists of diseases, matched by Boolean combinations of terms, and sorted in a variety of ways, such as alphabetical, or by frequency, or other. The CONSIDER program accepted a set of signs, symptoms, or other medical findings; and then responded by arraying a list of names of diseases that involved the set of medical findings that had been specified. Blois (1981) and associates at the University of California-San Francisco, expanded the program and called it RECONSIDER, that was able to match diseases by parts of disease names, or by phrases within definitions. Using a DEC 11/70 minicomputer with the VAX UNIX operating system, they were able to search inverted files of encoded text of Current Medical Information and Terminology (CMIT) 4th edition as the knowledge base. They concluded that RECONSIDER could be useful as a means of testing other diagnostic programs (Blois et al. 1981, 1982). Nelson (1983) and associates at New York State University at Stony Brook, tested various query strategies using the RECONSIDER program, and reported they were unable to determine a strategy that they considered to be optimal. Anderson et al. (1997) further modified the RECONSIDER program to use it for differential diagnoses; and added a time-series analysis program, an electrocardiogram-signal analysis program, an x-ray-images database and a digital-image analysis program.

In the 1960s commercial search and query programs for large databases became available, led by Online Analytic Processing (OLAP) that was designed to aid in providing answers to analytic queries that were multi-dimensional and used relational databases (Codd et al. 1993). Database structures were considered to be multidimensional when they contained multiple attributes, such as time periods, locations, product codes, diagnoses, treatments, and other items that could be defined in advance and aggregated in hierarchies. The combination of all possible aggregations of the base data was expected to contain answers to every query which could be answered from the data. In the early 1970s the Structured Query Language (SQL) was developed at IBM by Chamberlin and Boyce (1974) as a language designed for the query, retrieval, and management of data in a relational database-management system, such as had been introduced by Codd (1970). However, Nigrin and Kohane (1999) noted that in general, clinicians and administrators who were not programmers could not themselves generate novel queries using OLAP or SQL. Furthermore, Connolly and Begg (1999) advised that when querying a relational database and using the programming language, SQL, it required developing algorithms that optimized the length of time needed for computer processing if there were many transformations for a high-level query with multiple entities, attributes, and relations. T. Connolly also described a way of visualizing a multi-dimensional database by beginning with a flat file of a two-dimensional table of data; then adding another dimension to form a three-dimensional cube of data called a hypercube; and then adding cubes of data within cubes of data, with each side of each cube being called a dimension, with the result representing a multi-dimensional database. Pendse (2008) described in some detail the history of OLAP, and credited the publication in 1962 by K. Iverson of A Programming Language (APL) as the first mathematically defined, multidimensional language for processing multidimensional variables. Multidimensional analyses then became the basis for several versions of OLAP developed by International Business Machines (IBM) and others in the 1970s and 1980s; and in 1999 appeared as the Analyst module in Cognos that was subsequently acquired by IBM. By the year 2000 several new OLAP derivatives were in use by IBM, Microsoft, Oracle, and others (see also Sect. 2.2).

In 1970 C. McDonald and associates at the Regenstrief Institute for Health Care and the Indiana University School of Medicine, began to develop a clinical database for their Regenstrief Medical Record System (RMRS) (see also Sect. 4.3). Much of the clinical data was filed in a manually coded format that could be referenced to the system’s data dictionary; and it permitted each clinical subsystem to specify and define its data items. Data were entered by code, or by text that had been converted to code. The RMRS had a special retrieval program called CARE, that permitted non-programmers to perform complex queries of the medical-record files. CARE programs also provided quality of care reminders, alert messages, and recommended evidence-based practice guidelines (McDonald 1976, 1982). Myers (1970) and associates at the University of Pennsylvania, reported a system in which a pathology report was translated into a series of keywords or data elements that were encoded using arbitrarily assigned numbers. While the typist entered the text of the pathology report using a typewriter controlled by a paper-tape program, the data elements were automatically coded, and a punched paper tape was produced as a by-product of the typing. The report was then stored on either magnetic tape or on a disk storage system. Karpinski (1971) and associates at the Beth Israel Hospital in Boston, described their Miniature Information Storage and Retrieval (MISAR) System, written in the MUMPS language for their PDP-15 computer, and designed to maintain and search small collections of data on relatively inexpensive computers. MISAR was planned to deal with summaries of medical records in order to abstract from them correlations of clinical data. It was a flexible, easy-to-use, online system that permitted rapid manipulation of data without the need for any additional computer programming. A principal advantage of MISAR was the ease with which a small database could be created, edited, and queried at a relatively low cost. In 1972 Melski, also at the Beth Israel Hospital in Boston, used MISAR for eight registries; each consisting of a single file divided into patients’ records; and each record was divided into fields that could take on one or more values. MISAR stored its patients’ records in upright files that were arranged in order of the data items as they were collected; and the data were also reorganized in inverted files by data items, as for example, by laboratory chemistry sodium tests, in order to be able to rapidly perform searches and manipulate simple variables. Soon the system was expanded to MISAR II, with an increase in speed and able to serve simultaneously up to 22 user-terminals, and to accommodate interactive analyses of multi-center studies and of large clinical trials. They were impressed with this improved capability of using a convenient terminal to rapidly perform complex searches and analyses of data from a computer database (Melski et al. 1978).

In 1973 Weyl (1975) and associates at Stanford University Medical Center, developed their Time Oriented Databank (TOD) system that was designed as a table-driven computer system to record and analyze medical records. The TOD system consisted of more than 60 programs, which supported data entry and data update, file definition and maintenance, and data analysis functions. The TOD system was used on a mainframe computer for the ARAMIS database (see also Sect. 5.3). In 1982 the TOD system converted to a microcomputer-based version called MEDLOG (Layard et al. 1983). Enlander (1975) described a computer program that searched for certain pre-established key words in each diagnosis sentence according to a hierarchical structure that was based on the four-digit SNOP codes. As a test when this mode was applied to 500 diagnostic sentences, the automated key-word search then encoded about 75% of the sentences. In the clinical information system at Kaiser Permanente in Oakland, CA, Enlander used a visual-display terminal equipped with a light-pen pointer to select and enter a diagnosis, and the SNOP-coded diagnosis was then automatically displayed.

In 1976 a group at the Harvard School of Public Health developed a generalized database-management system called MEDUS/A, for the kinds of data generated in the clinical-care process, and also used for clinical research. Its principal mode of data acquisition and display was by the use of user-written, interactive questionnaires and reports (Miller and Strong 1978). In 1977 MEDUS/A was used at Harvard School of Public Health for a study that used data from patients with diabetes mellitus; and also for another study that used data from patients with coronary artery disease. King et al. (1983a, 1988) reported that MEDUS/A enabled nonprogrammers to use their databases and customize their data entry, support their data queries, generate reports, and provide statistical analyses. A second version of MEDUS/A was written in Standard MUMPS language (Goldstein 1980); and in 1983 a statistical package was added called GENESIS.

In 1976 a clinical information system called CLINFO was sponsored by the Division of Research Resources of the National Institutes of Health (NIH) for data entry, query, retrieval, and analysis. It was developed by a consortium of computer scientists at the Rand Corporation and a group of clinical investigators at Baylor College of Medicine, University of Washington, the University of Oklahoma, and at the Vanderbilt University. Lincoln et al. (1976) at the Rand Corporation and the University of Southern California, described the early CLINFO system that was used for a test group of leukemia patients. In a single, small, interactive, user-oriented system, it provided the integration of the schema, the study data file, the components designed for data entry and retrieval of time-oriented data, and a statistical analysis package. These units had been programmed separately, but their usefulness was increased by their integration. The Vanderbilt group that participated in the development of CLINFO reported on their first 5 years of experience with its use by more than 100 clinical investigators. They found that the positive and successful experience with the use of the CLINFO system was due to its set of functions directed towards data management and data analysis; and that it was a friendly, easy-to-use, computer tool; and it eliminated for its users the operational problems that often had been associated with their shared central-computer resources (Mabry et al. 1977; Johnston et al. 1982a, b). The CLINFO consortium reported a series of CLINFO-PLUS enhancements written in the C language; and that the system then consisted of about 100 systematically designed and closely integrated programs, by means of which a clinical investigator could specify for the computer the types of data being studied; then enter and retrieve the data in a variety of ways for display and analysis. The investigators communicated with the system by means of simple English-language word-commands, supported by a number of computer-generated prompts. The system was designed for a clinical investigator with no expertise in computing; and the investigator was not required to acquire any knowledge of computing in order to use the system (Whitehead and Streeter 1984; Thompson et al. 1977). By the end of the 1980s, CLINFO was widely used for clinical research in the United States. In 1988 the NIH Division of Research Resources (DRR) listed 47 of its 78 General Clinical Research Centers as using CLINFO for multidisciplinary and multicategorical research (NIH-DRR 1988). Some of these research centers also used a program similar to CLINFO called “PROPHET”, that was developed in the early 1980s by Bolt, Beranek and Newman in Cambridge, MA, and allowed the use of interactive, three-dimensional graphics designed more for the use of biomedical scientists than for clinical investigators. McCormick (1977) and associates in the Medical Information Systems Laboratory at the University of Illinois in Chicago, described their design of a relational-structured, clinical database to store and retrieve textual data, and also pictorial information such as for computer tomography, automated cytology, and other digitized images. Their Image Memory was incorporated into an integrated database system using a PDP 11/40 minicomputer. They predicted that an image database would become a normal component of every comprehensive medical database-management system that included digital-imaging technology.

With the increasing need to be able to efficiently query larger and multiple databases, it became evident that more efficient programs were needed for querying uncoded textual data. The need was to replace the usual key-word-in-context (KWIC) approach where the user would query uncoded textual data by selecting what were judged to be relevant key-words or phrases for the subject that the user wanted to query, and then have the program search for, match, and retrieve these key words or phrases in the context in which they were found in a reference knowledge source. One approach was to expand the number of key-words used to query the knowledge source in the hope that additional terms in a phrase or a sentence would allow the user to apply some semantic meaning since most English words have several meanings, and thus might improve the recognition and matching of the users’ information needs, and lead to better retrieval performance. In addition to query programs that permitted investigators to search and retrieve uncoded textual data from clinical databases by entering user-selected key-words or phrases, more sophisticated programs began to be developed to assist the investigator in studying medical hypotheses. More advanced NLP systems added knowledge bases to guide the user by displaying queries and their responses, and employing rules and decision trees that led to the best matching code. Although the search for matching words in a knowledge base made their retrieval easier, it was still difficult to search for and retrieve exact, meaningful expressions from text, since although it was easy to enter and store and match words, it was not always easy for the retriever to figure out what they had meant to the one who had originally entered the words into the knowledge base. Blois (1984) explained the problem by saying that computers were built to process the symbols fed to them in a manner prescribed by their programs, where the meaning of the symbols was known only to the programmers, rarely to the program, and never to the computer; consequently one could transfer everything in the data except its meaning. Blois further pointed out that the available codes rarely matched the clinical data precisely, and the user often had to force the data into categories that might not be the most appropriate. Some advanced automated NLP programs used machine-learning programs with algorithms that applied relatively simple rules such as, “if-then”, to automatically “learn” from a “training” knowledge base that consisted of a large set of sentences in which each had the correct part of speech attached to each word; and rules were generated for determining the part of speech for a word in the query based on the nature of the word in the query, the nature of adjacent words, and the most likely parts of speech for the adjacent words. Some used more complex statistical methods that applied weights to each input item and then made probabilistic decisions and expressed relative certainty of different possible answers rather than of only one. Machine-learning programs would then need to be tested for their accuracy by applying them to query new sentences.

Sager (1978, 1980, 1982a, b, 1983) and associates at New York University, in the late 1970s made substantial contributions to natural-language processing (NLP), when they initiated their Linguistic String Project (LSP) that extracted and converted the natural-language, free-text, uncoded narrative from patients’ medical records into a structured database; and they also addressed the problem of developing a query program for retrieval requests sent to the database. Story and Hirschman (1982) described the LSP’s early approach to NLP as first recognizing the time-dated information found in the text of patients’ hospital discharge summaries, such as dates and times of clinical events; and then computing from that information the ordering of the times of the recorded medical events. As examples, data used in patients’ discharge summaries included birth dates, admission and discharge dates, dates and times of any recorded patients’ symptoms, signs, and other important clinical events. Sager et al. (1982a, b) further described their LSP process for converting the uncoded natural-language text that was found in patients’ hospital discharge summaries, into a structured relational database. In a relational database the query process had to search several tables in order to complete the full retrieval; so that for a query such as, “Find all patients with a positive chest x-ray”, the program executed a query on one table to find the patients’ identification numbers, and then a sub-query on another table to find those patients reported to have positive chest x-ray reports. Whereas earlier attempts at automating encoding systems for text dealt with phrases that were matched with terms in a dictionary, this group first performed a syntactic analysis of the input data, and then mapped the analyzed sentences into a tabular format arrangement of syntactic segments, in which the segments were labeled according to their medical information content. Using a relational structured database, in their information-format table the rows corresponded to the successive statements in the documents, and the columns in the tables corresponded to the different types of information in the statements. Thus their LSP automatic-language processor parsed each sentence, and broke the sentence into syntactic components such as subject-verb-object; then divided the narrative segments into six statement types: general medical management, treatment, medication, test and result, patient state, and patient behavior; and it then transformed the statements into a structured tabular format. This transformation of the record was suitable for their database-management system; and it simplified the retrieval of a textual record, that when queried was transformed back to the users in a narrative form. Sager (1983, 1994) described in some detail their later approach to converting uncoded free-text patient data by relationships of medical-fact types or classes (such as body parts, tests, treatments, and others); and by subtypes or sub-classes (such as arm, blood glucose, medications, and others). Their Linguistic String Project (LSP) information-formatting program identified and organized the free text by syntactic analysis using standard methods of sentence decomposition; and then mapped the free-text into a linguistically structured, knowledge base for querying. The results of tests for information precision and information recall of their LSP system were better than 92% when compared to manual processing. In 1985 they reported that their medical-English lexicon, which gave for each word its English and medical classification, then numbered about 8,000 words (Lyman et al. 1985). Sager et al. (1986) reported that they had applied their methods of linguistic analysis to a considerable body of clinical narrative that included: patients’ initial histories, clinic visit reports, radiology and pathology reports, and hospital discharge summaries. They successfully tested their approach for automatic encoding of narrative text in the Head-and-Neck Cancer Database maintained at that time at the Roswell Park Memorial Institute. Sager et al. (1994) reported their Linguistic String Project (LSP) had been applied to a test set of asthma patients’ health-care documents; and when subjected to a SQL retrieval program the retrieval results averaged for major errors only 1.4%, and averaged 7.5% for major omissions. Sager et al. (1996) further reported using Web processing software to retrieve medical documents from the Web; and by using software based on Standard Generalized Markup Language (SGML) and Hypertext Markup Language (HTML), they coupled text markup with highlighted displays of retrieved medical documents.

Doszkocs (1983) and associates at the National Library of Medicine, noted that rapid advances had occurred in automated information-retrieval systems for science and technology. In the year of 1980 more than 1,000 databases were available for computerized searching, and more than two million searches were made in these databases. In the 1980s a variety of other approaches were developed for searching and querying clinical-research databases that were linked to patient-care databases. Kingsland’s (1982) Research Database System (RDBS) used microcomputers for storing and searching a relatively large number of observations in a relatively small number of patients’ records. Shapiro (1982) at the Medical University of South Carolina, developed a System for Conceptual Analysis of Medical Practices (SCAMP) that was able to respond to a query expressed in natural language. Words in free-text, rather than in codes, were used, such as, “Which patients had a prolapsed mitral valve?” The program parsed the request that was expressed in English; it looked up relevant matching words in a thesaurus, and passed linguistic and procedural information found in the thesaurus to a general-purpose retrieval routine that identified the relevant patients based on the free-text descriptions. Miller et al. (1983) System 1022 could access and query relational databases. Dozier et al. (1985) used a commercial Statistical Analysis System (SAS) database. Katz (1986) reported developing the Clinical Research System (CRS), that was a specialized, database-management system intended for storing and managing patient data collected for clinical trials, and designed for the direct use by physicians.

Porter (1984); Safran (1989a, b, c) and associates at the Boston’s Beth Israel Hospital, the Brigham and Women’s Hospital, and the Harvard Medical School, in 1964 expanded the PaperChase program (see also Sect. 9.2) into a program called ClinQuery, that was designed to allow physicians to perform searches in a large clinical database. ClinQuery was written in a dialect of MUMPS, and was used to search their ClinQuery database which contained selected patient data that was de-identified to protect patient’s privacy; and the data was transferred automatically every night from their hospitals clinical-information systems. Adams (1986) compared three query languages commonly used in the 1980s for medical-database systems: (1) The Medical Query Language (MQL) that was developed by O. Barnett’s group with an objective of query and report generation for patients using the Computer-Stored Ambulatory Record (COSTAR), and MQL was portable to any database using the MUMPS language. At that date COSTAR was used in more than 100 sites worldwide, with some carrying 200,000 patient records on-line. (2) The CARE System that was developed by C. McDonald’s group, with a focus on surveillance of quality of ambulatory patient care; and contained more than 80,000 patients’ records, and it was programmed in VAX BASIC running on a DEC VAX computer. (3) The HELP (Health Evaluation through Logical Processing) System that was developed by H. Warner’s group, with a focus on surveillance of hospital patient care, and was implemented on a Tandem system operating in the Latter Day Saints (LDS) hospitals in Utah. Adams reported that the three programs had some common properties, yet used different designs that focused on the specific objectives for which each was developed. Adams concluded that each was successful and well used:

Broering (1987, 1989) and associates at Georgetown Medical Center, described their BioSYNTHESIS system that was developed as a National Library of Medicine (NLM), Integrated Academic Information Management System (IAIMS) research project. The objective of the project was to develop a front-end software system that could retrieve information that was stored in disparate databases and computer systems. In 1987 they developed BioSYNTHESIS/I as a gateway system with a single entry pointing into IAIMS databases, to make it easier for users to access selected multiple databases. BioSYNTHESIS/II was developed to function as an information finder that was capable of responding to a user’s queries for specific information, and to be able to search composite knowledge systems containing disparate components of information. The system therefore had to be capable of functioning independently with the various knowledge bases that required different methods to access and search them. Hammond et al. (1989) reported that a program called QUERY was written to permit users to access any data stored in Duke’s The Medical Record (TMR) database. The program could access each patient’s record in the entire database or in a specified list of records, and carry out the query. The time for a typical query run, depending on the complexity of the query, was reported to require 4–6 h on a database containing 50,000–100,000 patients. Prather et al. (1995) reported that by 1990 the Duke group had converted their legacy databases into relational-structured databases so that personal computers using the SQL language could more readily query all of the patients’ records in the TMR clinical databases, that by 1995 had accumulated 25 years of patients’ data. Frisse (1989), Cousins (1990) and associates at Washington University School of Medicine, described a program they developed to enhance their ability to query textual data in large, medical, hypertext systems. As the amount of text in a database increased, they considered it likely that the proportion of text that would be relevant to their query would decrease. To improve the likelihood of finding relevant responses to a query, they defined a query-network as one that consisted of a set of nodes in the network represented by weighted search-terms considered to be relevant to their query. They assigned a weight to each search-term in the query-network based on their estimate of the conditional probability that the search-term was relevant to the primary index subject of their query; and the search-term’s weight could be further modified by user feedback to improve the likelihood of its relevance to the query. Searches were then initiated based on the relative search-term weights; and they concluded that their approach could aid in information retrieval and also assist in the discovery of related new information.

Frisse (1996) emphasized that information relevant to a task must be separated from information that is not considered relevant, and defined the relevance of a retrieved set of documents in terms of recall and precision. Frisse defined recall as the percentage of all relevant items in a collection retrieved in response to a query; and defined precision as the percentage of items retrieved that were relevant to the query. He defined sensitivity as the percentage of true positives that were identified; and specificity as the percentage of true negatives that were identified. He also noted that if the search were widened by adding to the query statement an additional search term using the word, “or”, then one was more likely to retrieve additional items of interest, but was also more likely to retrieve items not relevant to the specific query. Also, if one increased the number of constraints to a query by using the word, “and”, then one would retrieve fewer items but the items retrieved were more likely to be relevant to the expanded query. Levy and Rogers (1995) described an approach to natural language processing (NLP) that was used at that time in the Veteran’s Administration (VA). A commercial Natural Language Incorporated (NLI) software was the NLP interface that allowed English queries to be made of the VA database. Software links between the NLP program and the VA database defined relationships, entities, attributes, and their interrelationships; and queries about these concepts were readily answered. When a user typed in a question, the NLP processor interpreted the question, translated it into an SQL query and then responded. If the query was not understood by the NLP system, it then guided the user and assisted in generating a query which could be answered.

Friedman et al. (1992, 1998a, b) reviewed and classified some of the approaches to NLP developed in the 1980s. They classified NLP systems according to their linguistic knowledge as: (1) Pattern matching or keyword-based systems that were variations of the keyword-in-context approach in which the text was scanned by the computer for combinations of medical words and phrases, such as medical diagnoses or procedures, and used algorithms to match those in a terminology or vocabulary index; and when identified would be translated automatically into standard codes. These were relatively simple to implement but relied only on patterns of key words, so relationships between words in a sentence could not readily be established. This approach was useful in medical specialties that used relatively highly structured text and clinical sub-languages, such as in pathology and radiology. (2) Script-based systems combined keywords and scripts of a description or of a knowledge representation of an event that might occur in a clinical situation. (3) Syntactic systems parsed each sentence in the text, identified which words were nouns, verbs, and others; and noted their locations in the sequence of words in the sentence. These were considered to be minimal semantic systems, where some knowledge of language was used, such as syntactic parts of speech, so simple relationships in a noun phrase might be established but relationships between different noun phrases could not be determined, and it would require a lexicon that contained syntactic word categories and a method which recognized non-phrases. (4) Semantic systems added definitions, synonyms, meanings of terms and phrases, and concepts; and semantic grammars could combine frames to provide more domain-specific information. Semantic systems used knowledge about the semantic properties of words, and relied on rules that mapped words with specific semantic properties into a semantic model that had some knowledge of the domain and could establish relationships among words based on semantic properties, and could be appropriate for highly structured text that contained simple sentences. (5) Syntactic and semantic systems included stages of both of these processes, and used both semantic and syntactic information and rules to establish relationships among words in a document based on their semantic and syntactic properties. (6) Syntactic, semantic, and knowledge-based systems included reference, conceptual, and domain information, and might also use domain knowledge bases. These were the most complex NLPs to implement and were used in the most advanced NLP systems that evolved in the 1990s and the 2000s.

Das and Musen (1995) at Stanford University, compared three data-manipulation methods for temporal querying by: (1) the consensus query representation, Arden Syntax, (2) the commercial standard query language, SQL, and (3) the temporal query language, TimeLineSQL (TLSQL). They concluded that TLSQL was the query method most expressive for temporal data; and they built a system called “Synchronus” that had the ability to query their legacy SQL databases that supported various data time-stamping methods. O’Connor et al. (2000) also noted that querying clinical databases often had temporal problems when clinical data was not time-stamped; such as when a series of laboratory test reports did not provide the time-intervals between the tests. They developed a temporal query system called Tzolkin that provided a temporal query language and a temporal abstraction system that helped when dealing with temporal indeterminacy and temporal abstraction of data. Schoch and Sewell (1995) compared four commercial NLP systems that were reported to be used for searching natural-language text in MEDLINE: (1) FreeStyle (FS) from Lexis-Nexis, (2) Physicians Online (POL), (3) Target on Dialog (TA) from Knight-Ridder; and (4) Knowledge Finder (KF) available from Aries only on (CD-ROM). On 1 day in 1995, 36 topics were searched, using similar terms, directly on NLM’s MEDLINE; and the first 25 ranked references from each search were selected for analysis. They found that all four systems agreed on the best references for only one topic. Three systems, FS, KF, and TA chose the same first reference; and POL ranked it second. The four searches found 12 unique references with all concepts matching. The evaluation of NLP systems was often based on comparing their individual outputs for completeness of recall and for accuracy in matching of specified criteria, and sometimes as compared with the “gold-standard” of manual output by clinical experts; however, given a set of criteria, human evaluation was often found to be more variable in its results than computer evaluation.

Conceptual approaches to querying large, complex medical databases were developed in the 1990s; and were based on combining the characteristics of the query subject and creating a conceptual model for the search, rather than just using key words and phrases; and then ontologies of concepts and relationships of medical knowledge began to be developed. Chute (1995) and associates at the Mayo Clinic in Rochester, Minnesota, reported updating their legacy, 4.6-million, paper-based, patient-record Master Sheets that dated back to 1909; and with the addition of their newer electronic clinical database their researchers were confronted with more than 200 clinical specialized databases that resided on various hardware that used various software. They needed to interface these disparate databases on a spectrum of platforms to many types of workstations using a variety of browsers. To meet these problems and facilitate the retrieval of their stored medical information, they introduced Web protocols, graphical browsers, and several versions of Hypertext Mark-up Language (HTML) to link to their computer server. They also used the high-level language, Perl, which supported SQL interfaces to a number of relational-structured databases; and used Perl-interfaces for dynamically generated HTML screens. They also observed the legal need for maintaining the security and confidentiality of patient data when using the Web.

Hersh (1990a, b, 1995a, 1996a, 1998a, b) and associates at Oregon Health Sciences University, outlined their requirements for clinical vocabularies in order to facilitate their use with natural language processing (NLP) systems for their electronic medical records. The requirements should include: (1) lexical decomposition to allow the meaning of individual words to be recognized in the context of the entire sentence; (2) semantic typing to allow for identification of synonyms and their translation across semantic-equivalence classes; and (3) compositional extensibility to allow words to be combined to generate new concepts. They addressed the problem of accessing documents with desired clinical information when using the Web with its highly distributed information sources; and they reported developing an information-retrieval system called SAPHIRE (Semantic and Probabilistic Heuristic Information Retrieval Environment). SAPHIRE was modified from NLM’s UMLS Metathesaurus, which had been created by NLM to allow translation between terms within different medical vocabularies. SAPHIRE provided a Concept-Matching Algorithm that processed strings of free text to find concepts; and then mapped the concepts into a semantic-network structure for the purposes of providing both automated indexing and probabilistic retrieval by matching the diverse expressions of concepts present in both the reference documents and in the users’ queries. For the purpose of indexing, each textual document was processed one sentence at a time; and its concepts were weighted for terms occurring frequently, thereby designating a term’s value as an indexing concept. In retrieval the user’s query was processed to obtain its concepts, which were then matched against the indexing concepts in the reference documents in order to obtain a weighted list of matching documents. To formulate a search with SAPHIRE, the user entered a free-text query, and received back a list of concepts, to which the user could delete or add concepts; and the search was then initiated. A score was calculated summing the weights for all the concepts, and the concepts with highest scores were ranked for first retrievals. Hersh (1995a, b) reported a series of modifications to their concept-matching, indexing-algorithm to improve the sensitivity and specificity of its automated retrievals. He also completed some evaluations of recall and precision of automated information-retrieval systems compared to traditional key-word retrieval using text-words, and suggested that it was uncertain as to whether one indexing or retrieval method was superior to another. Spackman and Hersh (1996) and Hersh evaluated the ability of SAPHIRE to do automatic searches for noun phrases in medical-record discharge summaries by matching terms from SNOMED, and reported matches for 57% of the phrases. They also reported evaluating the ability of two NLP parsers, called CLARIT and the Xerox Tagger, to identify simple noun phrases in medical discharge summaries; and reported exact matches for 77% and 69%, respectively, of the phrases.

Hersh et al. (1996b) also reported developing CliniWeb, a searchable database of clinical information on the Web, that provided: (1) a database of clinically-oriented Universal Resource Locators (URLs); (2) an index of URLs with terms from the NLM’s MeSH vocabulary; (3) and an interface for accessing URLs by browsing and searching. He described problems due to Web databases being highly distributed and lacking an overall index for all of its information. CliniWeb served as a test-bed for research into defining the optimal method to build and evaluate a clinically oriented Web resource. The user could browse the MeSH hierarchy or search for MeSH terms using free-text queries; and then rapidly access the URLs associated with those terms. Hersh and Donohue (1998b) also noted that SAPHIRE could query a database in seven languages, other than English, by using a dictionary based on the multi-lingual aspects of the NLM’s Unified Medical Language System (UMLS) Metathesaurus. He also observed that in addition to the NLM, other health-related federal agencies used the Web for dissemination of free information, including the Centers for Disease Control and Prevention (CDC), the Food and Drug Administration (FDA), and the National Cancer Institute (NCI). Zacks and Hersh (1998) and Munoz and Hersh (1998), also working with W. Hersh, studied a variety of search strategies for retrieving medical-review articles from Web hypertext medical documents; and found a great variation in their sensitivity and specificity for accurately retrieving review articles on clinical diagnosis and therapy; and noted that the more complex strategies had higher accuracy rates. Price et al. (2002) also associated with W. Hersh, described developing Smart Query, that could provide context-sensitive links from the electronic medical record (EMR) to relevant medical-knowledge sources; and could help the clinician find answers to questions arising while using a patient’s EMR.

Cimino et al. (1990, 1994) reviewed some methods for information retrieval reported in the 1990s. Some were used to provide the retrieval of medical information from multiple sources, such as from clinical databases and from medical bibliographic resources; some included the use of NLM’s Unified Medical Language System (UMLS) for retrieving medical information by online bibliographic searches, and then integrating the information into their clinical databases. They concluded that additional work was needed to: (a) better understand the information needs of different users in different settings; (b) satisfy those needs through more sophisticated selection and use of information resources; (c) translate concepts from clinical applications to information resources; and (d) better integrate the users’ systems, since they noted that although early database-management systems allowed only their own data applications to be accessible from their own computer terminals, as they developed more advanced approaches they sought to integrate outside information sources at the application level so that patient data could be used for real-time, literature-retrieval as when an abnormal laboratory test raised questions that could be answered by a search of medical literature. In 1998 physicians at Vanderbilt University Hospital began to use their locally developed, computer-based, free-text summary-report system that facilitated the entry of a limited data summary report for the discharge or transfer of patients. They reported that two data-sets were most commonly used for these summaries: (1) patients’ treatment items, that comprised summaries of clinical care, in addition to patient’s awareness and action items; and (2) care-coordination items, that included patients’ discharge and contact information, and any social concerns. They recommended formalizing and standardizing the various clinical-specialty data-patterns to reduce the variability of the summary sign-out notes and to improve the communication of patient information (Campion et al. 2010).Zeng and Cimino (1999), evaluated the development of concept-oriented views of natural-language text in electronic medical records (EMRs). They also addressed the problem of “information overload” that often resulted when an excess of computer-generated, but unrelated, information was retrieved after clinical queries were entered when using EMRs. They compared the retrieval system’s ability to identify relevant patient data and generate either concept-oriented views or traditional clinical views of the original text; and they reported that concept-oriented views contained significantly less non-relevant information; and when responding to queries about EMR’s, using concept-oriented views showed a significantly greater accuracy in relevant information retrieval.

In the 1990s C. Friedman, J. Cimino, G. Hripcsak and associates at Columbia University in New York reported developing a natural language processing (NLP) system for the automated encoding and retrieval of textual data that made extensive use of the Unified Medical Language System (UMLS) of the National Library of Medicine (NLM). Their model was based on the assumption that the majority of information needs of users could be mapped to a finite number of general queries; and the number of these generic queries was small enough to be managed by a computer-based system but was too large to be managed by humans. A large number of queries by clinical users were analyzed to establish common syntactic and semantic patterns; and the patterns were used to develop a set of general-purpose, generic-queries; that were then used for developing suitable responses to common, specific, clinical-information queries. When a user typed in a question, their computer program would match it to the most relevant generic-query, or to a derived combination of queries. A relevant information resource was then automatically selected, and a response to the query was generated for presentation to the user. As an alternative, the user could directly select from a list of all generic-queries in the system, one or more potentially relevant queries, and a response was then developed and presented to the user. Using the NLM’s UMLS Metathesaurus they developed a lexicon they called AQUA (A Query Analyzer) that used a Conceptual Graph Grammar that combined both syntax and semantics to translate a user’s natural-language query into conceptual graph representations that were interpretations of the various portions of the user’s query; and that could be combined to form a corporate graph, that could then be parsed by a method that used the UMLS Semantic Net. Starting with identifying the semantic type that best represented the query, the parser looked for a word in a sentence of the given domain, for example, “pathology”, that could be descended from this type; then looked for semantic relations this word could have with other words in the sentence; and the algorithm then compiled a sublanguage text representing the response to the query (Cimino et al. 1993; Johnson et al. 1994, 1998).

Hripcsak et al. (1995) described developing a general-purpose NLP system for extracting clinical information from narrative reports. They compared the ability of their NLP system to identify any of six clinical conditions in the narrative reports of chest radiograms, and reported that the NLP system was comparable in its sensitivity and specificity to how radiologists read the reports. Hripcsak et al. (1996) reported that the codes in their database were defined in their vocabulary, the Medical Entities Dictionary (MED), which is based on a semantic network and serves to define codes and to map the codes to the codes used in the ancillary departments, such as the clinical laboratory codes. Hripcsak et al. (1996) also compared two query programs they used; (1) AccessMed that used their Medical Entities Dictionary (MED) and its knowledge base in a hierarchical network, with links to defining attributes and values. The AccessMed browser looked up query terms by lexical matching of words that looked alike and by matching of synonyms, and it then provided links to related terms. (2) Query by Review used a knowledge base structured as a simple hierarchy; and provided a browser that allowed a user to move to the target terms by a series of menus. They compared the recall and precision rates of these two programs to gather the vocabulary terms necessary to perform selected laboratory queries; and reported that Query by Review performed somewhat better than AccessMed; but neither was adequate for clinical work.

Friedman (1994, 1995a, b, 1997) and associates at Columbia University in New York made substantial contributions to natural language processing (NLP) with the development of their Medical Language Extraction and Encoding (MedLEE) system, that became operational in 1995 at Columbia-Presbyterian Medical Center (CPMC). Their NLP program was written in a Prolog language that could run on various platforms, and was developed at CPMC as a general purpose NLP system. Friedman described the MedLEE system as composed of functionally different, modular components (or phases), that in a series of steps each component processed the text and generated an output used by the subsequent component. The first component, the preprocessor, delineated the different sections in the report, separated the free-form textual data from any formatted data, used rules to determine word and sentence boundaries, resolved abbreviations, and performed a look-up in a lexicon to find words and phrases in the sentences that were required for the next parsing phase; and it then generated an output which consisted of lists of sentences and corresponding lexical definitions. The parser phase then used the lexical definitions to determine the structure of each sentence, and the parser’s sentence-grammar then specified its syntactic and semantic structures. The phrase-regularization component then regularized the terms in the sentence, re-composed multi-word terms that had been separated; and then contiguous and non-contiguous lexical variants were mapped to standard forms. The last phase, the encoder, then associated and mapped the regularized terms to controlled vocabulary concepts by querying the synonym knowledge base in their Medical Entities Dictionary (MED) for compatible terms. MED served as their controlled vocabulary that was used in automated mapping of medical vocabularies to the NLM’s Unified Medical Language System (Forman et al. 1995; Zeng and Cimino 1996). MED was their knowledge base of medical concepts that consisted of taxonomic and other relevant semantic relations. After using MED’s synonym knowledge base, the regularized forms were translated into unique concepts, so that when the final structured forms of the processed reports were uploaded to their Medical Center’s centralized patient database, they corresponded to the unique concepts in their MED. The output of the structured encoded form was then suitable for further processing and interfacing, and could be structured in a variety of formats, including reproducing the original extracted data as it was before encoding, or presented in an XML output, that with Markup language could highlight selected data. In their Medical Center the output was translated into an HL7 format and transferred into its relational medical database. All computer applications at their Medical Center could then reliably access the data by queries that used the structured form and the controlled vocabulary of their MED. Friedman et al. (1998a, b), described further development of the MedLEE system as one that analyzed the structure of an entire sentence by using a grammar that consisted of patterns of well-formed syntactic and semantic categories. It processed sentences by defining each word and phrase in the sentence in accordance with their grammar program; it then segmented the entire sentence at certain types of words or phrases defined as classes of findings, that could include medical problems, laboratory tests, medications, and other terms which were consistent with their grammar; it then defined as modifiers, qualifiers and values such items as the patient’ age, the body site, the test value, and other descriptors. For the first word or phrase in a segment that was associated with a primary finding that was identified in their grammar, an attempt was made to analyze the part of the segment starting with the left-most modifier (or value) of the primary finding; and this process was continued until a complete analysis of the segment was obtained After a segment was successfully analyzed, then the remaining segments in the sentence were processed by applying this same method to each segment; and the process of segmenting and analyzing was repeated until an analysis of every segment in each entire sentence was completed.

Friedman et al. (1998b) described some additional changes to MedLEE system that allowed five modes of processing: (1) The initial segment included the entire sentence, and all words and multi-word phrases needed to be arranged into a well-formed pattern. (2) The sentence was then segmented at certain types of words or phrases; and the process was repeated until an analysis of each segment was obtained. (3) An attempt was made to identify a well-formed pattern for the largest prefix of the segment. (4) Undefined words were skipped; and (5) the first word or phrase in the segment associated with a primary finding was identified; the left-most modifier of the finding was added; and the remaining portion was processed using the same method. The MedLEE system was initially applied to the radiology department where radiologists were interpreting their x-ray reports for about 1,000 patients per day. The radiologists dictated their reports that were generally well structured and composed mostly of natural-language text. The dictated reports were transcribed and entered into their Radiology Information System, and then transferred into the clinical database of their CPMC’s Clinical Information System. The automated reports of 230 chest x-rays were randomly selected and checked by two physicians; and showed a recall rate of 70% and a precision of 87% for four specified medical conditions. In another evaluation of more than 3,000 sentences, 89% were parsed successfully for recall, and 98% were considered accurate based on the judgment of an independent medical expert. Hripcsak et al. (1995), further evaluated the performance of their MedLEE system for 200 patients with six different medical diagnoses, and who each had chest x-rays; and found that their NLP system’s final performance report was the same as that of the radiologists. Friedman et al. (1996) reported extending a WEB interface to MedLEE by using a WEB browser, or by direct access for processing patients’ records using their Uniform Resource Locator (URL).

Cimino (1996) reviewed other automated information-retrieval systems that also used the NLM’s UMLS in some way, and proposed that additional work was needed to better understand the information needs of different users in different settings. Cimino (1996) also reviewed the evolution of methods to provide retrieval of information from multiple sources, such as from both clinical databases and bibliographic sources. Initially database management systems allowed their own various data applications to be accessible from the same computer terminal or workstation. More advanced approaches then sought to integrate outside information sources at the application level; so that, for example, patient data could be used to drive literature retrieval strategies, as when an abnormal laboratory test raised questions that could be answered by a search of recent medical literature. Cimino also reviewed a variety of methods reported in the 1990s; some included the use of NLM’s Unified Medical Language System (UMLS) that was employed to retrieve medical information by online bibliographic searches to integrate into their clinical databases. Zeng et al. (1999), evaluated the development of concept-oriented views of natural-language text in electronic medical records (EMRs). They addressed the problem of information-overload that often resulted when an excess of computer-generated, unrelated information was retrieved after clinical queries were entered when using EMRs. They compared the retrieval system’s ability to identify relevant patient data and generate either concept-oriented views or traditional clinical views of the original text, and reported that concept-oriented views contained significantly less non-specific information; and when answering questions about patient’s records, using concept-oriented views showed a significantly greater accuracy in information retrieval. Friedman and Hripsak (1998a) published an analysis of methods used to evaluate the performance of medical NLP systems, and emphasized the difficulty in completing a reliable and accurate evaluation. They noted a need to establish a “gold reference standard”; and they defined 21 requirements for minimizing bias in such evaluations. Friedman and Hripsak (1998b) also reported that most medical NLP systems could encode textual information as correctly as medical experts, since their reported sensitivity measures of 85% and specificity measures of 98% were not significantly different from each other. Medical NLP systems that were based on analysis of small segments of sentences, rather than on analysis of the largest well-formed segment in a sentence, showed substantial increases in performance as measured by sensitivity while incurring only a small loss in specificity. NLP systems that contained simpler pattern-matching algorithms that used limited linguistic knowledge performed very well compared to those that contained more complex linguistic knowledge.

The extension of MedLEE to a domain of knowledge other than radiology involved collecting a new training body of information. Johnson and Friedman (1996) noted that the NLP of discharge summaries in patients’ medical records required adding demographic data, clinical diagnoses, medical procedures, prescribed medications with qualifiers such as dose, duration, and frequency, clinical laboratory tests and their results; and be able to resolve conflicting data from multiple sources, and be able to add new single- and multi-word phrases; and found all in an appropriate knowledge base. Barrows et al. (2000) also tested the application of the MedLEE system to a set of almost 13,000 notes for ophthalmology visits that were obtained from their clinical database. The notational text that is commonly used by the clinicians was full of abbreviations and symbols, and was poorly formed according to usual grammatical construction rules. After an analysis of these records, a glaucoma-dedicated parser was created using pattern matching of words and phrases representative of the clinical patterns sought. This glaucoma-dedicated parser was used, and compared to MedLEE for the extraction of information related to glaucoma disease. They reported that the glaucoma-dedicated parser had a better recall rate than did MedLEE, but MedLEE had a better rate for precision; however, the recall and the precision of both approaches were acceptable for their intended use. Friedman (2000) reported extending the MedLEE system for the automated encoding of clinical information in text reports in to ICD-9, SNOMED, or UMLS codes.

Friedman et al. (2004) evaluated the recall and precision rates when the system was used to automatically encode entire clinical documents to UMLS codes. For a randomly selected set of 150 sentences, MedLEE had recall and precision rates comparable to those for six clinical experts. Xu and Friedman (2003) and Friedman et al. (2004) described the steps they used with MedLEE for processing pathology reports for patients with cancer: (1) Their first step was to identify the information in each section, such as the section called specimen; (2) identify the findings needed for their research project; (3) analyze the sentences containing the findings, and then extend MedLEE’s general schema to include representing their structure; (4) adapt MedLEE so that it would recognize their new types of information which were primarily genotypypic concepts and create new lexical entrees; (5) to minimize the modifications to MedLEE, a preprocessing program was also developed to transform the reports into a format that MedLEE could process more accurately, such as when a pathology report included multiple specimens it was necessary to link reports to their appropriate specimen; (6) the last step was to develop a post-processing program to transform the data needed for a cancer registry. Cimino (2000) described a decade of use of MED for clinical applications of knowledge-based terminologies to all services in their medical center, including specialty subsystems.

Cao et al. (2004) reported the application of the MedLEE system in a trial to generate a patient’s problem-list from the clinical discharge summaries that had been dictated by physicians for a set of nine patients, randomly selected from their hospital files. The discharge summary reports were parsed by the MedLEE system, and then transformed to text knowledge-representation structures in XML format that served as input to the system. All the findings that belonged to the preselected semantic types were then extracted, and these findings were weighted based on the frequency and the semantic type; and a problem list was then prepared as an output. A review by clinical experts found that for each patient the system captured more than 95% of the diagnoses, and more that 90% of the symptoms and findings associated with the diagnoses.

Bakken et al. (2004), reported the use of MedLEE for narrative nurses’ reports; and compared the semantic categories of MedLEE with the semantic categories of the International Standards Organization (ISO) reference terminology models for nursing diagnoses and nursing actions. They found that all but two MedLEE diagnosis and procedure-related semantic categories could be mapped to ISO models; and they suggested areas for extension of MedLEE. Nielson and Wilson (2004) at the University of Utah reported developing an application that modified MedLEE’s parser that, at the time, required sophisticated rules to interpret its structured output. MedLEE parsed a text document into a series of observations with associated modifiers and modifier values; the observations were then organized into sections corresponding to the sections of the document; the result was an XML document of observations linked to the corresponding text; and manual rules were written to parse the XML structure and to correlate the observations into meaningful clinical observations. Their application employed a rule engine developed by domain experts to automatically create rules for knowledge extraction from textual documents; so it allowed the user to browse through the raw text of the parsed document, select phrases in the narrative text, and then it dynamically created rules to find the corresponding observations in the parsed document.

Zhou et al. (2006) used MedLEE to develop a medical terminology model for surgical pathology reports. They collected almost 900,000 surgical pathology reports that contained more than 104,000 unique terms; and that had two major patterns for reporting procedures beginning with either “bodyloc” (body location) or “problem”. They concluded that a NLP system like MedLEE provided an automated method for extracting semantic structures from a large body of free text, and reduced the burden for human developers of medical terminologies for medical domains. Chen et al. (2006) reported a modification in the structured output from MedLEE from a nested structured output into a simpler tabular format that was expected to be more suitable for some uses such as spread sheets.

Lussier et al. (2006) reported using BioMedLEE system, an adaptation of MedLEE that focuses on extracting and structuring biomedical entities and relations including phenotypic and genotypic information in biomedical literature, for automatically processing text in order to map contextual phenotypes to the Gene Ontology Annotations (GOA) database which facilitates semantic computations for the functions, cellular components and processes of genes. Lussier described the PhenoGo system that can automatically augment annotations in the GOA with additional context, by using BioMedLEE and an additional knowledge-based organizer called PhenOS, in conjunction with MeSH indexing and established biomedical ontologies. PhenoGo was evaluated for coding anatomical and cellular information, and for assigning the coded phenotypes to the correct GOA, and found to have a precision rate of 91% and a recall rate of 92%. Chen et al. (2008) also described using MedLEE and BioMedLEE to produce a set of primary findings (such as medical diagnoses, procedures, devices, medications), with associated modifiers (such as body sites, changes, frequencies). Since NLP systems had been used for knowledge acquisition because of their ability to rapidly and automatically extract medical entities and findings, relations and modifiers within textual documents, they described their use of both NLP systems for mining textual data for drug-disease associations in MEDLINE articles and in patients’ hospital discharge summaries. They focused on searching the textual data for eight diseases that represented a range of diseases and body sites, for any strong associations between these diseases and their prescribed drugs. BioMedLEE was used to encode entities and relations within the titles and abstracts of almost 82,000 MEDLINE articles, and MedLee was used to extract clinical information from more than 48,000 discharge summaries. They compared the rates of specific drug-disease associations (such as levodopa for Parkinson’s disease) found in both text sources; and concluded that the two text sources complemented each other, since the literature focused on testing therapies for relatively long time-spans, whereas discharge summaries focused on current practices of drug uses. They also concluded that they had demonstrated the feasibility of the automated acquisition of medical knowledge from both biomedical literature and from patients’ records. Wang et al. (2008), described using MedLEE to test for symptom-disease associations in the clinical narrative reports of a group of hospitalized patients; and reported an evaluation on a random sample for disease-symptom associations with an overall recall rate of 90% and a precision of 92%. Borlawsky et al. (2010), reviewed semantic-processing approaches to NLP for generating integrated data sets from published biomedical literature. They reported using BioMedLEE and a subset of PhenoGo algorithms to extract, with a high degree of precision, encoded concepts and determine relationships among a body of PubMed abstracts of published cancer and genetics literature, with a high degree of precision.

Hogarth (2000) and associates at the University of California in Davis introduced Terminology Query Language (TQL) as a query-language interface to server implementations of concept-oriented terminologies. They observed that terminology systems generally lacked standard methodologies for providing terminology support; and TQL defined a query-based mechanism for accessing terminology information from one or more terminology servers over a network connection. They described TQL to be a declarative language that specified what to get rather than how to get it, and it was relatively easy to use as a query-language interface that enabled simple extraction of terminology information from servers implementing concept-oriented terminology systems. They cited as a common example of another query-language interface, the Structured Query Language (SQL) for relational databases (see Sect. 2.3). TQL allowed the data structures and names for terminology-specific data types to be mapped to an abstract set of structures with intuitively familiar names and behaviors. The TQL specification was based on a generic entity-relationship (E/R) schema for concept-based terminology systems. TQL provided a mechanism for operating on groups of “concepts” or “terms” traversing the information space defined by a particular concept-to-concept relationship, and extracted attributes for a particular entity in the terminology. TQL output was structured in XML that provided a transfer format back to the system requesting the terminology information. Seol et al. (2001) noted that it was often difficult for users to express their information needs clearly enough to retrieve relevant information from a computer database system. They took an approach based on a knowledge base that contained patterns of information needs, and they provided conceptual guidance with a question-oriented interaction based on the integration of multiple query contexts, such as application, clinical, and document contexts, based on a conceptual-graph model and using XML language. Medonca et al. (2001) also reviewed NLP systems and examined the role that standardized terminologies could play in the integration between a clinical system and literature resources, as well as in the information retrieval process. By helping clinicians to formulate well-structured clinical queries and to include relevant information from individual patient’s medical records, they hoped to enhance information retrieval to improve patient care by developing a model that identified relevant information themes and added a framework of evidence-based practice guidelines.

With the advance of wireless communication, Lacson and Long (2006) described the use of mobile phones to enter into their computer in natural language the time-stamped, spoken, dietary records collected from adult patients over a period of a few weeks. They classified the food items and the food quantifiers, and developed a dietary/nutrient knowledge base with added information from resources on food types, food preparation, food combinations, portion sizes, and with dietary details from the dietary/nutrient resource database of 4,200 individual foods reported in the U.S. Department of Agriculture’s Continuing Survey of Food Intakes by Individuals (CSFII). They then developed an algorithm to extract the dietary information from their patients’ dietary records, and to automatically map selected items to their dietary/nutrient knowledge database. They reported a 90% accuracy in the automatic processing of the spoken dietary records. Borlawsky et al. (2010), reviewed semantic-processing approaches to NLP for generating integrated data sets from published biomedical literature. They reported using BioMedLEE and a subset of PhenoGo algorithms to extract, with a high degree of precision, encoded concepts and determine relationships among a body of PubMed abstracts of published cancer and genetics literature, with a high degree of precision.

Informatics for Integrating Biology and the Bedside (i2b2) was established in 2004 as a Center at the Partners HealthCare System in Boston, with the sponsorship of the NIH National Centers for Biomedical Computing; and it was directed by Kohane, Glaser, and Churchill (https://www.i2b2.org). Murphy et al. (2007) described i2b2 as capable of serving a variety of clients by providing an inter-operable framework of software modules, called the i2b2 Hive, to store, query, and retrieve very large groups of de-identified patient data, including a natural language processing (NLP) program. The i2b2 Hive used applications in units, called cells, which were managed by the i2b2 Workbench. The i2b2 Hive was an open-source software platform for managing medical-record data for purposes of research. It had an architecture that was based upon loosely coupled, document-style Web services for researchers to use for their own data; with adequate safeguards to protect the confidentiality of patient data that was stored in a relational database, and that was able to fuse with other i2b2 compliant repositories. It thereby provided a very large, integrated, data-repository for studies of very large patient groups. The i2b2 Workbench consisted of a collection of users’ “plug-ins” that was contained within a loosely coupled visual framework, in which the independent plug-ins from various user teams of developers could fit together. The plug-ins provided the manner in which users interfaced with the other cells of the Hive. When a cell was developed, a plug-in could then be used to support its operations (Chueh and Murphy 2006). McCormick (2008) and associates at Columbia University in New York, reported that in response to an i2b2 team’s challenge for using patients’ discharge summaries for testing automated classifiers for the status of smokers (as a current smoker, non-smoker, past smoker, or status unknown), they investigated the effect of semantic features extracted from clinical notes for classifying a patient’s smoking status and compared the performance of supervised classifiers to rule-based symbolic classifiers. They compared the performance of: (1) a symbolic rule-based classifier, which relied on semantic features (generated by MedLEE); (2) a supervised classifier that relied on semantic features, and (3) a supervised classifier that relied only on lexical features. They concluded that classifiers with semantic features were superior to purely lexical approaches; and that the automated classification of a patient’s smoking status was technically feasible and was clinically useful.

Himes (2008) and associates at Harvard Medical School and Partners HealthCare System, reported using the i2b2 natural-language processing (NLP) program to extract both coded data and unstructured textual notes from more than 12,000 electronic patient records for research studies on patients with bronchial asthma. They found that the data extracted by this means was suitable for such research studies of large patient populations. Yang et al. (2009) used the i2b2 NLP programs to extract textual information from clinical discharge summaries, and to automatically identify the status of patients with a diagnosis of obesity and 15 related co-morbidities. They assembled a knowledge base with lexical, terminological, and semantic features to profile these diseases and their associated symptoms and treatments. They applied a data mining approach to the discharge summaries of 507 patients, which combined knowledge-based lookup and rule-based methods; and reported a 97% accuracy in predictions of disease status, which was comparable to that of humans. Ware et al. (2009), also used the i2b2 NLP programs to focus on extracting diagnoses of obesity and 16 related diagnoses from textual discharge summary reports, and reported better than 90% agreement with clinical experts as the comparative “gold standard”. Kementsietsidis (2009) and associates at the IBM T. J. Watson Research Center developed an algorithm to help when querying clinical records to identify patients with a defined set of medical conditions, called a “conditions-profile”, that was required for a patient to have in order to be eligible to participate in a clinical trial or a research study. They described the usual selection process was to first query the database and identify an initial pool of candidate-patients whose medical conditions matched the conditions-profile; and then to manually review the medical records of each of these candidates, and identify the most promising patients for the study. Since that first step could be complicated, and very time-consuming in a very large patient database if one used simple keyword searches for a large number of selection criteria in a conditions-profile, they developed an algorithm that identified compatibilities and incompatibilities between the conditions in the profile. Through a series of computational steps the program created a new conditions-profile, and returned to the researcher a smaller list of patients who satisfied the revised conditions-profile; and this new list of patients could then be manually reviewed for those suited for the study.

Meystre and Haug (2003, 2005) at the University of Utah described their development of a NLP system to automatically analyze patients’ longitudinal electronic-medical records (EMRs), and to ease for clinicians the formation of a patient’s medical-problem list. They developed from the patients’ problem-oriented medical records in their Intermountain Health Care program a problem list of about 60,000 concepts. Using this as a knowledge base, their Medical Problem Model identified and extracted from the narrative text in an active patient’s EMR a list of the potential medical problems. Then a Medical Document Model used a problem-list management-application to form a problem list that could be useful for the physician. In the Intermountain Health Care program that used their HELP program, their objective was to use this NLP system to automate the development of problem lists, and to automatically update and maintain them for the longitudinal care of both ambulatory and hospital patients. Meystre et al. (2009) also reported installing and evaluating an i2b2 Hive for airway diseases including bronchial asthma, and reported that it was possible to query the structured data in patients’ electronic records with the i2b2 Workbench for about half of the desired clinical data-elements. Since smoking status was typically mentioned only in clinical notes, they used their natural-language processing (NLP) program in the i2b2 NLP cell, and they found the automated extraction of patients’ smoking status had a mean sensitivity of 0.79 and a mean specificity of 0.90.

Childs et al. (2009) described using ClinREAD, a rule-based natural-language processing (NLP) system, developed by Lockheed Martin, to participate in the i2b2 Obesity Challenge program to build software that could query and retrieve data from patients’ clinical discharge summaries and make judgments as to whether the patients had, or did not have, obesity and any of 15 comorbidities (including asthma, coronary artery disease, diabetes, and others). They developed an algorithm with a comprehensive set of rules that defined word-patterns to be searched for in the text as literal text-strings (called “features”), that were grouped to form word-lists that were then matched in the text for the presence of any of the specified disease comorbidities. Fusaro (2010) and associates at Harvard Medical School, reported transferring electronic medical records from more than 8,000 patients into an i2b2 database using Web services. Gainer (2010) and associates from Partners Healthcare System, Massachusetts General Hospital, Brigham and Women’s Hospital, and the University of Utah, described their methods for using i2b2 to help researchers query and analyze both coded and textual clinical data that were contained in electronic patient records. Using data from the records of patients with rheumatoid arthritis, the group of collaborating investigators were required to develop new concepts and methods to query and analyze the data, to add new vocabulary items and intermediate data-processing steps, and some custom programming.

Wynden (2010a, b) and associates at the University of California, San Francisco (UCSF), described their Integrated Data Repository (IDR) project that contained various collections of clinical, biomedical, economic, administrative, and public health data. Since standard data warehouse design was usually difficult for researchers who needed access to a wide variety of data resources, they developed a translational infrastructure they called OntoMapper, that translated terminologies into formal data-encoding standards without altering the underlying source data; and also provided syntactic and semantic interoperability for the grid-computing environments on the i2b2 platform; and they thereby facilitated sharing data from different resources. Sim (2010) and associates from UCSF and several other medical institutions, employed translational informatics and reported their collaboration in the Human Studies Database (HSDB) Project to develop semantic and data-sharing technologies to federate descriptions of human studies. Their priorities for sharing human-studies data included: (1) research characterization of populations, such as by outcome variables; (2) registration of studies into the database, ClinicalTrials.gov; and (3) facilitating translational research collaborations. They used UCSF’s OntoMapper to standardize data elements from the i2b2 data model; and they shared data using the National Cancer Institute’s caGrid technologies. Zhang (2010) and associates at Case Western Reserve and University of Michigan, developed a query interface for clinical research they called Visual Aggregator and Explorer (VISAGE), that incorporated three interrelated components: (1) Query Builder with ontology-driven terminology support; (2) Query Manager that stored and labeled queries for reuse and sharing; and (3) Query Explorer for comparative analyses of query results. Together these components helped with efficient query construction, query sharing, and data exploration; and they reported that in their experience VISAGE was more efficient for query construction than the i2b2 Web-client. Logan (2010) and associates at Oregon Health and Portland State Universities, reviewed the use of graphical-user interfaces to query a variety of multi-database systems, with some using SQL or XML languages and others having been designed with an entity-attribute-value (EAV) schema. They reported using Web Ontology Language (OWL) to query, select, and extract desired fields of data from these multiple data sources; and then to re-classify, re-modify, and re-use the data for their specific needs.

Translational informatics developed in the 2000s to support querying diverse information resources that were located in multiple institutions. The National Center of Biomedical Computing (NCBC) developed technologies to address locating, querying, composing, combining, and mining biomedical resources; and each site that intended to contribute to the inventory needed to transfer a biosite-map that conformed to a defined schema and a standard set of metadata. Mirel (2010) and associates at the University of Michigan, described using their Clinical and Translational Research Explorer project with its Web-based browser that facilitated searching and finding relevant biomedical resources for biomedical research. They were able to query more than 800 data resources from 38 institutions with Clinical and Translational Science Awards (CTSA) funding. Their project was funded by the National Centers for Biomedical Computing (NCBC), and was developed through a collaborative effort of ten institutions and 40 cross-disciplinary specialists. They defined a set of task-based objectives and user requirements to support users of their project. Denny (2010) and associates at Vanderbilt University developed an algorithm for phenome-wide association scans (PheWAS) when identifying genetic associations in electronic medical records (EMRs) of patients. Using the International Classification of Diseases (ICD9) codes, they developed a code translation table and automatically defined 776 different disease population groups derived from their EMR data. They genotyped a group of 6,005 patients in their Vanderbilt DNA biobank, at five single nucleotide polymorphisms (SNPs), who also had ICD9 codes for seven, selected, associated medical diagnoses (atrial fibrillation, coronary artery disease, carotid artery disease, Crohn’s disease, multiple sclerosis, rheumatoid arthritis, and systemic lupus erythematosis) to investigate SNP-disease associations. They reported that using the PheWAS algorithm, four of seven known SNP-disease associations were replicated, and also identified 19 previously unknown statistical associations between these SNPs and diseases at P  <  0.01.

In the 2000s the general public use of Internet search engines increased, including the use of NLM’s PubMed and other NLM databases, by entering keywords or phrases for information about diseases and possible treatments. Google was very frequently queried, and it ranked its websites based on the numbers of “hits” on its websites. In a study of Google’s effectiveness in searching for medical information, Wang et al. (2010) reported that its specificity was good, while its sensitivity for providing true relevant responses might not always be satisfactory.

3.4 Summary and Commentary

In the 1950s most physicians recorded their hand-written notes on paper forms that were then collected and stored in their patients’ paper-based charts. Surgeons, pathologists, and some other clinicians dictated their reports that described the procedures they had performed on patients; and these reports were then transcribed by secretaries and deposited in the patients’ records. Since much of the data in a medical database was entered, stored, queried, and retrieved in natural language text, it was always evident that the processing of textual data was a critical requirement for a medical database.

In the 1960s some natural language processing (NLP) was performed by matching key words or phrases. In the 1970s NLP systems for automated text processing were primarily syntax-based programs that parsed the text by identifying words and phrases as subjects or predicates, and as nouns or verbs. After completing the syntax analysis, then semantic-based programs attempted to recognize the meanings of the words by referring to data dictionaries or knowledge bases, and using rewrite-rules to generate the text that had been represented by the stored codes.

By the 1980s NLP systems used both syntactical and semantical approaches, with knowledge bases that suggested how expert human parsers would interpret the meaning of words within their particular information contexts. In the 1990s NLP systems were able to provide both the automatic encoding of textual data and the retrieving of the stored textual data. In the 2000s NLP systems were sufficiently developed to be able to use convergent medical terminologies, to automatically encode textual data, and to successfully query natural language stored in medical databases.