MScanner: a classifier for retrieving Medline citations
- 8.6k Downloads
Keyword searching through PubMed and other systems is the standard means of retrieving information from Medline. However, ad-hoc retrieval systems do not meet all of the needs of databases that curate information from literature, or of text miners developing a corpus on a topic that has many terms indicative of relevance. Several databases have developed supervised learning methods that operate on a filtered subset of Medline, to classify Medline records so that fewer articles have to be manually reviewed for relevance. A few studies have considered generalisation of Medline classification to operate on the entire Medline database in a non-domain-specific manner, but existing applications lack speed, available implementations, or a means to measure performance in new domains.
MScanner is an implementation of a Bayesian classifier that provides a simple web interface for submitting a corpus of relevant training examples in the form of PubMed IDs and returning results ranked by decreasing probability of relevance. For maximum speed it uses the Medical Subject Headings (MeSH) and journal of publication as a concise document representation, and takes roughly 90 seconds to return results against the 16 million records in Medline. The web interface provides interactive exploration of the results, and cross validated performance evaluation on the relevant input against a random subset of Medline. We describe the classifier implementation, cross validate it on three domain-specific topics, and compare its performance to that of an expert PubMed query for a complex topic. In cross validation on the three sample topics against 100,000 random articles, the classifier achieved excellent separation of relevant and irrelevant article score distributions, ROC areas between 0.97 and 0.99, and averaged precision between 0.69 and 0.92.
MScanner is an effective non-domain-specific classifier that operates on the entire Medline database, and is suited to retrieving topics for which many features may indicate relevance. Its web interface simplifies the task of classifying Medline citations, compared to building a pre-filter and classifier specific to the topic. The data sets and open source code used to obtain the results in this paper are available on-line and as supplementary material, and the web interface may be accessed at http://mscanner.stanford.edu.
KeywordsCross Validation MeSH Area Under Curve Control Corpus Trained Classifier
Ad-hoc information retrieval
Information retrieval on the biomedical literature indexed by Medline  is most often carried out using ad-hoc retrieval. The PubMed  boolean search engine is the most widely used Medline retrieval system. Other interfaces to searching Medline include relevance ranking systems such as Relemed  and systems such as EBIMed  that perform information extraction and clustering on results. Certain web search engines such as Google Scholar  also index much of the same literature as Medline. Alternatives to ordinary queries include the related articles feature of PubMed , which returns the Medline records most similar to a given record of interest, and the eTBlast  search engine which ranks Medline abstracts by their similarity to a given paragraph of text.
Supervised learning for database curation
Ad-hoc retrieval in general has proven inefficient for the task of identifying articles relevant to databases that require manual curation of entries from biomedical literature, such as the Pharmacogenetics Knowledgebase (PharmGKB) , and for constructing corpora for automated text mining systems such as Textpresso [9, 10]. It is difficult to design an expert boolean query (the knowledge engineering approach to document classification ) that recalls most of the relevant documents without retrieving many irrelevant documents at the same time, when there are many document features that potentially indicate relevance.
The case of many relevant features is, however, effectively handled using supervised learning, in which a text classifier is inductively trained from labelled examples [12, 13]. Several databases have therefore used supervised learning to filter Medline for relevant documents , a recent example being the Immune Epitope Database (IEDB) . IEDB researchers first used a sensitive PubMed query several pages in length to obtain a Medline subset of 20,910 records. The components of the query had previously been used by IEDB curators, whose manual relevance judgements formed a "gold standard" training corpus of 5,712 relevant and 15,198 irrelevant documents. Different classifier algorithms and document representations were evaluated under cross validation, and their performance compared using the area under the Receiver Operating Characteristic (ROC) curve . The best of the trained classifiers is to be applied to future results of the sensitive query to reduce the number of documents that have to be manually reviewed.
Supervised learning has also been used to identify Medline records relevant to the Biomolecular Interaction Network Database , the ACP Journal Club for evidence based medicine , the Textpresso resource , and the Database of Interacting Proteins (DIP) . Classification may also be performed on full-text articles as in the TREC 2005 Genomics Track , and Cohen  provides a general-purpose classifier for the task. Most classifiers have been developed for filtering sets of a few thousand Medline records, but it is possible to classify larger subsets of Medline and even the whole Medline database. A small number of methods have been developed for larger data sets, including an ad-hoc scoring method that has been tested on a stem cell subset of Medline , the PharmGKB curation filter , and the PubFinder  web application derived from the DIP curation filter . However, tasks submitted to the PubFinder site in mid-2006 are still processing and the maintainers are unreachable. In some cases, text mining for relationships between named entities is used instead of supervised learning to judge relevance – for example in the more recent curation filter developed for the DIP . The most closely related articles  to individual articles in a collection have also been used to update a bibliography  or a database .
Comparison of information retrieval approaches
Approaches to retrieving relevant Medline records for database curation have included ad-hoc retrieval (boolean retrieval in particular), related article search, and supervised learning. Pure boolean retrieval systems like PubMed return (without ranking) all documents that satisfy the logical conditions specified in the query. The vector space models used by web search engines rank documents by similarity to the query, and probabilistic retrieval models rank documents by decreasing probability of relevance to the topics in the query . Related article search retrieves documents by their similarity to a query document, which can be accomplished by using the document as a query string in a ranking ad-hoc retrieval system tuned for long queries [6, 29]. Overlap in citation lists has also been used as a benchmark for relatedness . The method used in PubMed related articles  directly evaluates the similarity between a pair of documents over all topics (corresponding to vocabulary terms) using a probabilistic model. Supervised learning trains a document classifier from labelled examples, framing the problem of Medline retrieval as a problem of classifying documents into the categories of "relevant" and "irrelevant". Classifiers may either produce ranked outputs or make hard judgements like a boolean query . Statistical classifiers, such as the Naïve Bayes classifier used here, use the same Probability Ranking Principle as probabilistic ad-hoc retrieval systems . Ranked classifier results may loosely be considered to contain documents closely related to the relevant examples as a whole.
Overview of MScanner
We have developed MScanner, a classifier of Medline records that uses supervised learning to identify relevant records in a non-domain-specific manner. The user provides only relevant citations as training examples, with the rest of Medline approximating the irrelevant examples for training purposes. Most classifiers are developed for particular databases, a limitation that we address by demonstrating effectiveness in multiple domains and providing facilities to evaluate the classifier on new inputs. We make it easier to use text classification by providing a web interface and operating on all of Medline instead of a Medline subset. To attain the high speeds necessary for online use, we used an optimised implementation of a Naïve Bayes classifier, and a compact document representation derived from two feature spaces in the Medline record metadata, namely the Medical Subject Headings (MeSH) and the journal of publication (ISSN). The choice of the MeSH feature space is informed by a previous study , in which classification using MeSH features performed well on PharmGKB citations. We describe the use of the classifier, present example cross validation results, and evaluate the classifier on a gold standard data set derived from an expert PubMed query.
Web interface workflow
Feature scores for PG07.
p(F i = 1|R)
p(F i = 1|)
1744–6872 (Pharmacogenet. Genomics)
1470-269X (Pharmacogenomics J.)
Organic Anion Transport Polypeptide C
Cytochrome P-450 CYP2D6
Xeroderma Pigmentosum Group D Protein
Organic Anion Transporters, Sodium-Independent
The submission form allows some of the classifier parameters to be adjusted. These include setting an upper limit on the number of results, or restricting Medline to records completed after a particular date (useful when monitoring for new results). More specialised options include the estimated fraction of relevant articles in Medline (prevalence), and the minimum score to classify an article as relevant. Higher estimated prevalence produces more results by raising the prior probability of relevance (see Methods), while higher prediction thresholds return fewer results, for greater overall precision at the cost of recall.
Cross validation protocol
The web interface provides a 10-fold cross validation function. The input examples are used as the relevant corpus, and up to 100,000 PubMed IDs are selected at random from the remainder of Medline to approximate an irrelevant corpus. In each round of cross validation, 90% of the data is used to estimate term frequencies, and the trained classifier is used to calculate article scores for the remaining 10%. Graphs derived from the cross validated scores include article score distributions, the ROC curve  and the curve of precision as a function of recall. Metrics include area under ROC and average precision .
Cross validation statistics.
ROC Std Error
Distributions of article scores
Receiver Operating Characteristic
Precision under cross validation
Performance in a retrieval situation
To evaluate classification performance in a retrieval situation we compared the performance of MScanner to the performance of an expert PubMed query that was used to identify articles for the Immune Epitope Database (IEDB). We made use of the 20,910 results of a sensitive expert query that had been manually split into 5,712 relevant and 15,198 irrelevant articles for the purpose of training the IEDB classifier . MeSH terms were available for 20,812 of the articles, of which 5,680 were relevant and 15,132 irrelevant. The final data set is provided in Additional File 2. To create training and testing corpora, we first restricted Medline to the 783,028 records completed in 2004, a year within the date ranges of all components of the IEDB query. For relevant training examples we used the 3,488 relevant IEDB results from before 2004, and we approximated irrelevant training examples using the whole of 2004 Medline. We then used the trained classifier to rank the articles in 2004 Medline.
We also compared MScanner to the IEDB classifier on its cross validation data, to evaluate the trade-off between performance and speed. The IEDB uses a Naïve Bayes classifier with word features derived from a concatenation of abstract, authors, title, journal and MeSH, followed by an information gain feature selection step and extraction of domain-specific features (peptides and MHC alleles). Using cross-validation to calculate scores for the collection of 20,910 documents, the IEDB classifier obtained an area under ROC curve of 0.855, with a classification speed (after training) of 1,000 articles per 30 seconds. MScanner, using whole MeSH terms and ISSN features, obtained an area under ROC of 0.782 ± 0.003, with a classification speed of approximately 15 million articles per 30 seconds. However, the prior we used for frequency of term occurrence (see Methods) is designed for training data where the prevalence of relevant examples is low. The prevalence of 0.27 in the IEDB data is much higher than the prevalences in Table 2, and using the Laplace prior here would improve the ROC area to 0.825 ± 0.003 but degrade performance in cross validation against Medline100K. The remaining difference in ROC between MScanner and the IEDB classifier reflects information from the abstract and domain-specific features not captured by the MeSH feature space. All ROC AUC values on the IEDB data are much lower than in the sample cross validation topics. This is because it is more difficult to distinguish between relevant and irrelevant articles among the closely related articles resulting from an expert query, than to distinguish relevant articles from the rest of Medline.
Uses of supervised learning for Medline retrieval
Supervised learning has already been applied to the problem of database curation and the development of text mining resources. However, using a web service like MScanner to perform supervised learning is a simple operation compared to constructing a boolean filter, gold standard training set, and custom-built classifier. MScanner may supplement existing workflows that use a pre-filter query by detecting relevant articles inadvertently excluded by the filter. Another possibility is using MScanner in place of a filter query when one is unavailable, and confirming relevance by passing on the results to a stronger classifier or an information extraction method such as that used by the Database of Interacting Proteins . Supervised learning can also be used in other scenarios where relevant training examples are readily available and the presence of many relevant features hinders ad-hoc retrieval. For example, individual researchers could leverage the documents in a personal bibliography to identify additional articles relevant to their research interests.
MScanner's performance varies by topic, depending on the degree to which features are enriched or depleted in relevant articles compared to Medline. The relative performance on different corpora also depends on the evaluation metric used. For example, ROC performance on PG07 shows lower overall ability to distinguish pharmacogenetics articles from Medline, but the right hand sub-plot of Figure 4 shows higher recall at low false positive rates on PG07 than AIDSBio. Besides the topic itself, the size of the training set can also influence performance. For the complex topics curated by databases, many relevant examples may be needed to obtain good coverage of terms indicating relevance. Narrower topics, such as the Radiology corpus, require fewer training examples to obtain good estimates of the frequencies of important terms. Too few training examples, however, will result in poor estimates of term frequencies (over-estimates, or failure of important terms to be represented), degrading performance. The use of a random set of Medline articles as the set of irrelevant articles in training (Medline100K in the use cases we presented) can also influence performance in cross validation. It can inflate the false positive rate to some extent because it contains relevant articles that are not part of the relevant training set.
The score distributions for the Control corpus (Figure 3) were somewhat anomalous, with multiple narrow modes. This is due to the larger irrelevant corpus derived from Medline100K containing low-frequency features not present in the Control corpus. Each iteration of training therefore yielded many rare features with scores around -8 to -10. The four narrow peaks correspond to the chance presence of 0, 1, 2 or 3 of those features, which were influential because other features scored close to zero. In non-random corpora (AIDSBio, PG07 and Radiology), the other non-zero features dominate to produce broader unimodal distributions. Removing features unique to Medline100K reduced the Control distribution to the expected single narrow peak between -5 and +5.
We represented Medline records as binary feature vectors derived from MeSH terms and journal ISSNs. These are separate feature spaces: a MeSH term and ISSN consisting of the same string would be not be considered the same feature. Medline provides each MeSH term in a record as a descriptor in association with zero or more qualifiers, as in "Nevirapine/administration & dosage". To reduce the dimensionality of the feature space we treat the descriptor and qualifier as separate features. We detected 24,069 distinct MeSH features in use, and 17,191 ISSN features, for an average of 13.5 features per record. The 2007 MeSH vocabulary comprises 24,357 descriptors and 83 qualifiers. Of the journals, about 5,000 are monitored by PubMed and the rest are represented by a only few records each. An advantage of the MeSH and ISSN feature spaces is that they allow a compact document representation using 16-bit features, which increases classification speed. MeSH is also a controlled vocabulary, and so does not have word sense ambiguities like free text. However the vocabulary does not cover all concepts, and covers some areas of biology and medicine (such as medical terminology) more densely than others. Also, not every article has all relevant MeSH terms assigned, and there is a tendency for certain terms to be assigned to articles that just discuss the topic, such as articles "about dental research" rather than dental research articles themselves .
Performance can be improved by adding an additional space of binary features derived from the title and abstract of the document. Not relying solely on MeSH features would also enable classification of Medline records that have not been assigned MeSH descriptors yet. The additional features would, however, reduce classification speed due to larger document representations, introduce redundancy with the MeSH feature space, and require a feature selection step. The IEDB classifier  avoids redundancy by concatenating the abstract with the MeSH terms and using a single feature space of text words. Binary features should model short abstracts relatively well, although performance on longer texts is known to benefit from considering multiple occurrences of terms [35, 36].
MeSH annotations and journal ISSNs are domain-specific resources in the biomedical literature. The articles cited by a given article (although not provided in Medline) are another domain-specific resource that may prove useful in retrieval tasks, in addition to their uses in navigating the citation network. For example, the overlap in citation lists has been used as a benchmark for article relatedness . In supervised learning, it may be possible to incorporate the number of co-citations between a document and relevant articles, or to use the citing of an article as a binary feature.
MScanner inductively learns topics of interest from example citations, with the aim of retrieving a large number of topical citations more effectively than with boolean queries. It represents an advance on previous tools for Medline classification by performing well across a range of topics and input sizes, by making available implementation source code, and by operating on all of Medline fast enough to use over a web interface. As a non-domain-specific classifier, it has a facility for performing cross validation to obtain ROC and precision statistics on new inputs. MScanner should be useful as a filter for database curation where a sensitive filter query and customised classifier are not already available, and in general for constructing large bibliographies, text mining corpora and other domain-specific Medline subsets.
The greatest support scores for occurring features are shown in Table 1, when the classifier has been trained to perform PG07 retrieval. For computational efficiency, the non-occurrence support scores, Y(F i = 0), are simplified to a base score (of an article with no features) and a small adjustment for each feature that occurs.
We estimate the prior probability of relevance P(R) using the number of training examples divided by the number of articles in Medline, and the classifier predicts relevance for articles with S(f) ≥ 0. The prior and minimum score for predicting relevance may also be set on the web interface.
Estimation of feature frequencies
And similarly for p(F i = 1|). Probabilities for non-occurrence of features are of the form p(F i = 0|R) = 1 - p(F i = 1|R). Bayesian classifiers normally use a Laplace prior , which specifies one prior success and one prior failure for each feature. However, the Laplace prior performs poorly here because of class skew in the training data: when irrelevant articles greatly outnumber relevant ones it over-estimates P(F i = 1|R) relative to P(F i = 1|), in particular for terms not observed in any relevant examples.
Data structures enabling fast classification
MScanner's classification speed is due to the use of a Bayesian classifier, a compact feature space, and a customised implementation. Training in retrieval tasks is made much faster by keeping track of the total number of occurrences of each term in Medline. The MeSH and ISSN feature spaces fit in 16-bit feature IDs, and each Medline record has an average of 13.5 features. Including some overhead, this allows the features of all 16 million articles in Medline to be stored in a binary stream of around 600 MB. A C program takes 32 seconds to parse this file and calculate article scores for all of Medline, returning those above the specified threshold. The rest of the program is written in Python , using the Numpy library for vector operations. Source code is provided in Additional File 3.
For storing complete Medline records, we used a 22 GB Berkeley DB indexed by PubMed ID. It was generated by parsing the Medline Baseline  distribution, which consists of 70 GB XML compressed to 7 GB and split into files of 30,000 records each. During parsing, a count of the number of occurrences of each feature in Medline is maintained, ready to be used for training the classifier. To look up feature vectors in cross validation, we use a 1.3 GB Berkeley DB instead of the binary stream.
Construction of PG07, AIDSBio, Radiology and Medline100K
The PG07, AIDSBio and Radiology corpora provided in Additional File 4 are from different domains and are of different sizes, to illustrate the different use cases mentioned in the results. The PG07 corpus comprises literature annotations taken from the PharmGKB  on 5 February 2007. The AIDSBio corpus is the intersection of the PubMed AIDS  and Bioethics  subsets on 19 October 2006. The Radiology corpus is a bibliography of 67 radiology articles focusing on the spleen, obtained from a co-worker of DR's. The corpora exclude records that do not have status "MEDLINE", and thus lack MeSH terms. The Medline100K corpus consists of 100,000 randomly selected Medline records, with completion dates up to 21 January 2007, which is also the upper date for the Control corpus of 10,000 random citations. The size of Medline100K was chosen to provide a good approximation of the Medline background, while containing few unknown relevant articles.
Availability and requirements
Project Name: MScanner
Home Page: http://mscanner.stanford.edu
Operating Systems: Platform independent
Minimum Requirements: Internet Explorer 7, Mozilla Firefox 2, Opera 9, or Safari 3
License: GNU General Public License
This work is supported by the University of Cape Town (UCT), the South African National Research Foundation (NRF), the National Bioinformatics Network (NBN), and the Stanford-South Africa Bio-Medical Informatics Programme (SSABMI), which is funded through US National Institutes of Health Fogarty International Center Grant D43 TW06993, and PharmGKB associates by grant NIH U01GM61374. Thank you to Tina Zhou for setting up the server space for MScanner, and Prof. Vladimir Bajic for a helpful discussion.
- 1.Fact Sheet: MEDLINE[http://www.nlm.nih.gov/pubs/factsheets/medline.html]
- 2.Fact Sheet: PubMed®: MEDLINE®R Retrieval on the World Wide Web[http://www.nlm.nih.gov/pubs/factsheets/pubmed.html]
- 5.Google Scholar[http://scholar.google.com]
- 11.Sebastiani F: A Tutorial on Automated Text Categorisation. In Proceedings of ASAI-99, 1st Argentinian Symposium on Artificial Intelligence Edited by: Amandi A, Zunino R, Buenos Aires AR. 1999, 7–35.Google Scholar
- 17.Donaldson I, Martin J, de Bruijn B, Wolting C, Lay V, Tuekam B, Zhang S, Baskin B, Bader GD, Michalickova K, Pawson T, Hogue CWV: PreBIND and Textomy-mining the biomedical literature for protein-protein interactions using a support vector machine. BMC Bioinformatics 2003, 4: 11.PubMedCentralCrossRefPubMedGoogle Scholar
- 20.Hersh W, Cohen A, Yang J, Bhupatiraju R, Roberts P, Hearst M: TREC 2005 Genomics Track Overview. The Fourteenth Text REtrieval Conference (TREC 2005) 2005.Google Scholar
- 21.Cohen AM: An effective general purpose approach for automated biomedical document classification. AMIA Annu Symp Proc 2006, 161–165.Google Scholar
- 24.Goetz T, von der Lieth CW: PubFinder: a tool for improving retrieval rate of relevant PubMed abstracts. Nucleic Acids Res 2005, (33 Web Server):W774-W778.Google Scholar
- 26.Liu X, Altman RB: Updating a bibliography using the related articles function within PubMed. Proc AMIA Symp 1998, 750–754.Google Scholar
- 33.Davis J, Goadrich M: The relationship between Precision-Recall and ROC curves. In ICML 2006: Proceedings of the 23rd International Conference on Machine learning. New York, NY, USA: ACM Press; 2006:233–240.Google Scholar
- 35.McCallum A, Nigam K: A comparison of event models for Naive Bayes text classification. Tech. rep., Just Research 1998.Google Scholar
- 39.2007 MEDLINE®R/PubMed®R Baseline Distribution[http://www.nlm.nih.gov/bsd/licensee/2007_stats/baseline_doc.html]
- 40.National Library of Medicine AIDS Subset Strategy[http://www.nlm.nih.gov/bsd/pubmed_subsets/aids_strategy.html]
- 41.National Library of Medicine Bioethics Subset Strategy[http://www.nlm.nih.gov/bsd/pubmed_subsets/bioethics_strategy.html]
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.