BioReader: a text mining tool for performing classification of biomedical literature
Scientific data and research results are being published at an unprecedented rate. Many database curators and researchers utilize data and information from the primary literature to populate databases, form hypotheses, or as the basis for analyses or validation of results. These efforts largely rely on manual literature surveys for collection of these data, and while querying the vast amounts of literature using keywords is enabled by repositories such as PubMed, filtering relevant articles from such query results can be a non-trivial and highly time consuming task.
We here present a tool that enables users to perform classification of scientific literature by text mining-based classification of article abstracts. BioReader (Biomedical Research Article Distiller) is trained by uploading article corpora for two training categories - e.g. one positive and one negative for content of interest - as well as one corpus of abstracts to be classified and/or a search string to query PubMed for articles. The corpora are submitted as lists of PubMed IDs and the abstracts are automatically downloaded from PubMed, preprocessed, and the unclassified corpus is classified using the best performing classification algorithm out of ten implemented algorithms.
BioReader supports data and information collection by implementing text mining-based classification of primary biomedical literature in a web interface, thus enabling curators and researchers to take advantage of the vast amounts of data and information in the published literature. BioReader outperforms existing tools with similar functionalities and expands the features used for mining literature in database curation efforts. The tool is freely available as a web service at http://www.cbs.dtu.dk/services/BioReader
KeywordsDatabase curation Text mining Machine learning Biological databases Literature survey PubMed Document classification
Document term matrix
Term frequency-inverse document frequency
Article classification techniques thus facilitate systematic knowledge extraction from the entire corpus of biomedical literature. To enable the broader community to benefit from this workflow, we have implemented the relevant methods from text mining, machine learning, and bioinformatics in a web service for article classification and retrieval, which outperforms simple keyword search functions native to PubMed, Google Scholar, etc. To illustrate the utility of BioReader in achieving a better and more fine-grained classification, we compared its performance against the closest resembling existing web service, MedlineRanker , and discuss a number of use case for which we have utilized the method for database curation.
The webserver offers a simple interface where users are prompted to upload two lists of PubMed IDs: two lists for the training categories (e.g. positive and negative for content of interest) as well as one list of PubMed IDs corresponding to abstracts to be classified as belonging to one of the two groups. The abstracts are retrieved using NCBI’s Entrez programming utilities, E-utilities.
Text pre-processing and corpus formation
Once abstracts are retrieved, the three text corpora are generated and the following operations are performed on the text: lowercase transformation, stop word removal, punctuation removal, word stemming, and whitespace stripping. As many gene names contain numeric characters, numbers found in conjunction with letters are not removed. All of the above operations are performed using the “NLP” and “tm”  packages for R.
Document-term matrix formation and classifier training
After corpus formation, the texts are tokenized in document term matrices (DTM), which are essentially feature vectors of word counts for all words in all documents in the corpus. Word counts are background corrected by term frequency-inverse document frequency (Tf-Idf) transformation , which offsets the count of a given word, by the number of documents in the corpus it occurs in, thereby reducing the importance of words that appear more frequently in general. Terms in the transformed DTMs are then reduced to the top terms differentiating the two training classes, as determined by a Mann-Whitney U test . The resulting training corpora DTMs are used to train and test ten different classification algorithms (support vector machine , elastic-net regularized generalized linear model , maximum entropy , scaled linear discriminant analysis, bagging , boosting , random forest , k-nearest neighbor , regression tree , and naïve Bayes classifiers) to accommodate corpora of different size and complexity . The best performing algorithm is determined by five-fold cross-validation on the training set and the documents to be classified are subsequently assigned positive or negative for content of interest using this algorithm.
The output consists of performance metrics from the five-fold cross-validation on the training data and two lists of article titles, corresponding to the classification of the test set. The input list is ranked by descending probability of abstracts falling within the two categories. In addition to the result lists, the top 50 terms with most differential frequency between the two training classes (25 for each class) are visualized by a word cloud, enabling users to refine their PubMed search term based on the terms in each class. The class separation is visualized in a PCA plot, with the newly classified articles highlighted.
Performance evaluation data
To evaluate the performance of BioReader, we used two curated abstract sets from the IEDB curation procedure . One corpus consists of 1000 abstracts of articles containing epitope-specific data or epitope structure as well as 1000 abstracts of articles that does not contain epitope relevant data and information. The other corpus consists of 1000 abstracts of articles related to infectious diseases and 1000 abstracts related to non-infectious diseases (allergy, autoimmunity, cancer, etc.). Both corpora were randomly subdivided into sets of 1500 abstracts for training (including five-fold cross-validation and construction of learning curves) and 500 abstracts for performance evaluation.
Comparison to MedlineRanker
MedlineRanker  enables users to input a single list of relevant literature, which is then used to rank publications from PubMed – either a randomly chosen subset, articles published within a data range, or a specific subset of articles. As an advanced option, MedlineRanker also enables classification based on two lists: 1) a list of articles of interest (positive list), and 2) a background list of irrelevant articles (negative list). We here compare the performance of BioReader to the advanced function of MedlineRanker.
Results and discussion
The performance of BioReader depends heavily on the size of the training set, how well the training set captures the differences between classes, and the inherent ability of a given set to be separated into the desired classes. Here we demonstrate that BioReader can successfully predict whether articles contain epitope-specific data or epitope structure, and from a separate corpus, which articles relates to infectious diseases vs. non-infectious diseases (allergy, autoimmunity, cancer, etc.) .
Use case 1: Classifying articles for disease type and epitope content
For the epitope content example, the corpus of 2000 abstracts for which the articles were manually curated to be positive for epitope content was subsequently manually classified for infectious disease vs. non-infections disease content. In this example, the glmnet also proved to be superior in five-fold cross-validation on 1500 abstracts, and the learning curve (Additional file 2) indicated that a training set of around 600 abstracts (300 in each category) resulted in near optimal performance. Training on the full training corpus and subsequent testing on 500 abstracts excluded from the initial training yielded an AUC of 0.953, and 0.941, 0.854, and 0.898, in specificity, sensitivity, and accuracy, respectively.
Use case 2: Classifying articles for surface protein expression data
Throughout the history of molecular biology researchers have been accumulating information about cells, including their functions, molecular composition, development from stem cells, and role in disease. Many of these studies rely on immunophenotyping using molecular surface markers to distinguish cells, diseases, or developmental stages of interest. The dynamic surface marker profiles of cells have been extensively used as biomarkers indicative of different biological states (e.g. developmental stage, disease state, etc.), for cell sorting, and for therapeutics, where specific surface markers are used to direct therapeutic agents to diseased cells, using either monoclonal antibodies or cell-based therapies. Traditionally, studies revealing new knowledge about cells, their surface markers, and the complex dynamic relationship between the two have been communicated and shared almost exclusively in the primary scientific literature.
We utilized BioReader and manual data extraction to assemble a comprehensive data set of human hematopoietic cells and their corresponding quantitative or qualitative presence (depending on availability) of known molecular surface markers. Utilizing over 6000 data points across 305 CD molecules on 206 cell types, we characterized the “human hematopoietic CDome” and found that surface markers provided a higher resolution functional classification of hematopoietic cellular function than transcriptome-wide expression analyses .
Feature comparison of BioReader, MedlineRanker, and MScanner
Positive class input
Negative class input
Classification list input
All words (stemmed to consolidate counts), MeSH, journal, authors
support vector machine, elastic-net regularized generalized linear model, maximum entropy, supervised latent Dirichlet allocation, bagging, boosting, random forest, k-nearest neighbor, regression tree, and naïve Bayes classifiers
Naïve Bayes classifier
Naïve Bayes classifier
Ranked lists, term signature (positive and negative), separation visualization (PCA), performance metrics
Ranked lists, term signature (positive), performance metrics
Standalone source code available
No (but offers API)
We have created a flexible implementation of a number of well-known and established text mining tools, designed to cater to a variety of classification tasks with biomedical literature. We have demonstrated that with a relatively small set of manually categorized articles, users can classify up to 1000 PubMed articles per run (and no limits on the number of runs). BioReader outperforms existing tools for classification tasks and offers new and improved features.
Availability and requirements
Project name: BioReader
Project home page: http://www.cbs.dtu.dk/services/BioReader
Operating system(s): Platform independent
Programming language: R, Perl
Other requirements: None
License: GNU GPL.
Any restrictions to use by non-academics: License needed.
This work and publication costs are funded by The Lundbeck Foundation [grant R181–2014-3761].
Availability of data and materials
Our web server freely available at http://www.cbs.dtu.dk/services/BioReader and the source code is available at https://bitbucket.org/lronn/bioreader_standalone. Additional information about methodology, usage optimization, example workflows, and example data, are available at http://www.cbs.dtu.dk/services/BioReader/instructions.php
About this supplement
This article has been published as part of BMC Bioinformatics Volume 19 Supplement 13, 2018: 17th International Conference on Bioinformatics (InCoB 2018): bioinformatics. The full contents of the supplement are available online at https://bmcbioinformatics.biomedcentral.com/articles/supplements/volume-19-supplement-13.
The tool was conceptualized by LRO, CS, and MBB. LRO wrote the source code. KD set up the webserver. CH, MBB, and ES evaluated performance and performed comparison to other tools. LRO and CH wrote the manuscript. All authors read and approved the manuscript.
Ethics approval and consent to participate
Consent for publication
The authors declare that they have no competing interests.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
- 2.Caspi R, Altman T, Dreher K, Fulcher CA, Subhraveti P, Keseler IM, et al. The MetaCyc database of metabolic pathways and enzymes and the BioCyc collection of pathway/genome databases. Nucleic Acids Res. 2012;40(Database issue):D742–53. https://doi.org/10.1093/nar/gkr1014.CrossRefPubMedGoogle Scholar
- 4.Olsen LR, Tongchusak S, Lin H, Reinherz EL, Brusic V, Zhang GL. TANTIGEN: a comprehensive database of tumor T cell antigens. Cancer Immunol Immunother 2017;0:0. doi: https://doi.org/10.1007/s00262-017-1978-y.
- 8.Barnkob MS, Simon C, Olsen LR. Characterizing the human hematopoietic CDome. Front Genet. 2014;5. https://doi.org/10.3389/fgene.2014.00331.
- 9.Simon C, Mike B, Olsen LR. Software-supported selection of cell surface proteins for cell stratification and chimeric antigen receptor-based therapies. Blood. 2014;124:5116.Google Scholar
- 10.Fontaine J-F, Barbosa-Silva A, Schaefer M, Huska MR, Muro EM, Andrade-Navarro MA. MedlineRanker: flexible ranking of biomedical literature. Nucleic Acids Res 2009;37 Web Server issue:W141–W146. doi: https://doi.org/10.1093/nar/gkp353.
- 11.Feinerer I, Hornik K, Meyer D. Text mining infrastructure in R. J Stat Softw. 2008;25.Google Scholar
- 12.Manning CD, Raghavan P, Schutze H. Scoring, term weighting, and the vector space model. In: Introduction to information retrieval. Cambridge: Cambridge University Press. p. 100–23. https://doi.org/10.1017/CBO9780511809071.007.
- 16.Nigam K, Lafferty J, Mccallum A. Using maximum entropy for text classification. 1999.Google Scholar
- 21.Quinlan J. Induction of decision trees. Mach Learn. 1986;1:81–106.Google Scholar
- 22.Jurka T, Collingwood L. RTextTools: a supervised learning package for text classification. R J. 2013;5:6–12.Google Scholar
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.