A statistical approach to identify, monitor, and manage incomplete curated data sets
Many biological knowledge bases gather data through expert curation of published literature. High data volume, selective partial curation, delays in access, and publication of data prior to the ability to curate it can result in incomplete curation of published data. Knowing which data sets are incomplete and how incomplete they are remains a challenge. Awareness that a data set may be incomplete is important for proper interpretation, to avoiding flawed hypothesis generation, and can justify further exploration of published literature for additional relevant data. Computational methods to assess data set completeness are needed. One such method is presented here.
In this work, a multivariate linear regression model was used to identify genes in the Zebrafish Information Network (ZFIN) Database having incomplete curated gene expression data sets. Starting with 36,655 gene records from ZFIN, data aggregation, cleansing, and filtering reduced the set to 9870 gene records suitable for training and testing the model to predict the number of expression experiments per gene. Feature engineering and selection identified the following predictive variables: the number of journal publications; the number of journal publications already attributed for gene expression annotation; the percent of journal publications already attributed for expression data; the gene symbol; and the number of transgenic constructs associated with each gene. Twenty-five percent of the gene records (2483 genes) were used to train the model. The remaining 7387 genes were used to test the model. One hundred and twenty-two and 165 of the 7387 tested genes were identified as missing expression annotations based on their residuals being outside the model lower or upper 95% confidence interval respectively. The model had precision of 0.97 and recall of 0.71 at the negative 95% confidence interval and precision of 0.76 and recall of 0.73 at the positive 95% confidence interval.
This method can be used to identify data sets that are incompletely curated, as demonstrated using the gene expression data set from ZFIN. This information can help both database resources and data consumers gauge when it may be useful to look further for published data to augment the existing expertly curated information.
KeywordsZebrafish Danio rerio Gene expression Machine learning Curation
In recent years, the biological sciences have benefited immensely from new technologies and methods in both biological research and computer sciences. Together these advances have produced a surge of new data. Biological research now relies heavily upon expertly curated database resources for rapid assessment of current knowledge on many topics. Management, organization, standardization, quality control, and crosslinking of data are among the important tasks these resources provide. It is commonplace today for these data to be widely shared and combined, increasing the impact that incomplete or incorrect data may have on downstream data consumers. Although assessing how complete or correct a large data set may be remains a challenge, examples have been reported. Examples include computational methods for identifying data updates and artifacts that may be of interest to downstream data consumers , machine learning methods to identify incorrectly classified G-protein coupled receptors , and to improve the quality of large data sets prior to quantitative structure-activity relationship modeling . The completeness and quality of curated nanomaterial data has also been explored .
What does it mean for a data set to be “complete” or “incomplete”? Data can be incomplete in two ways: missing values for variables, or missing entire records which could be included in a data set. Handling missing variable values in statistical analyses is a complex topic outside the scope of this article. In the context of this work, “complete” means all currently published data of a specific type is present in the data set with no missing values for any variables. In this study, data from the ZFIN Database has been used to find genes that have an incomplete gene expression data set, genes for which there exist published but not yet curated gene expression data.
There are several reasons data repositories may not include all relevant published data, including high data volume, selective partial curation, delays in data access, and release of data prior to the ability to curate it. High data volume can result in the need for prioritization of the incoming data stream. For example, ZFIN is the central data repository for expertly curated genetic and genomic data generated using the zebrafish (Danio rerio) as a model system . One major data input to ZFIN is the published scientific literature. A search of PubMed for all zebrafish literature shows that this corpus has consistently increased in volume by 10% every year since 1996 resulting in a greater than 10-fold increase in the number of publications processed by ZFIN in 2016 (2865 publications) compared to 1996 . Such increases necessitate prioritization to focus effort on the data deemed most valuable by the research community. As a result, curation of some publications is delayed or prevented all together. ZFIN currently includes curated data from approximately 25% of the incoming literature within 6 months of publication.
Data sets can also be incomplete relative to what has been published if publications are curated for selective data types. Publications that are not fully curated when they initially enter a database may later be partially curated during projects focusing on specific topics. For example, the gene functions were curated from all the publications associated with genes involved in kidney development . In such cases, publications may get functional data, but no other data types, curated.
Delayed data access also contributes to curated data sets being incomplete. There is significant variation in how soon the full text of a publication may be available. Some journals have embargo periods which restrict publication access to those with personal or institutional subscriptions. Delayed access to the full text of publications slows data entry into data repositories, such as ZFIN, which require the full text to curate. ZFIN currently obtains full text for approximately 50%, 80%, and 90% of the zebrafish literature within 6 months, 1 year, and 3 years of publication respectively.
Incompletely curated data sets also result when new data types are published prior to database resources having the ability to curate those data. Curation of gene expression data commenced at ZFIN in 2005 . Curating papers from earlier years, known as “back curation”, is something that many curation teams don’t have resources to support. Gene expression data published earlier than 2005 may only be curated at ZFIN if they were brought forward as part of an ongoing project or topic focused curation effort in subsequent years.
Why is it important to know if a data set includes all relevant published data? This knowledge can help database resources focus expert curation effort where it is needed. Likewise, if a researcher is aware that a data set may be missing records, they may look further for additional relevant published data to complete the data set. Having knowledge of all the published data helps to avoid wasted time, money, and effort repeating work already done by others, and also helps to avoid flawed hypothesis generation based on incomplete data.
In recent years, natural language processing (NLP) and machine learning methods have been widely used in the field of genetics and genomics on tasks such as prediction of intron/exon structure, protein binding sites, gene expression, gene interactions, and gene function . In addition, model organism databases have used NLP and machine learning methods for over a decade to manage and automate processing of the increasing volume of publications that must be identified, prioritized, indexed, and curated [9, 10, 11, 12, 13]. These methods are applied to the incoming literature stream, prior to curation. Machine learning methods can also have utility after curation in maintaining the quality and completeness of curated data sets. The aim of this study was to provide a statistical approach to identify curated data sets that may be incomplete relative to what has been published. The ZFIN gene expression data set was the use case for this study. Researchers and data management teams alike can use the output of this method to guide resource allocation, decision making, and interpretation of data sets with insight into whether additional data may be available to augment an expertly curated data set.
ZFIN gene expression annotations
Variables selected for model training and their predictive value score.
Journal publications with gene expression data
Percent of journal publications with expression data
Total journal publication count
Results of model testing
Mean Absolute Error
Root Mean Squared Error
Relative Absolute Error
Relative Squared Error
Coefficient of Determination
The purpose for making this model was to locate genes in the ZFIN database that have incompletely curated gene expression data sets relative to what has been published. The RMSE of the model output when run on the test data set was 3.368747 (Table 2).
Expertly curated biological database resources contain highly accurate data . Sometimes accuracy comes at the expense of being comprehensive due to prioritization of resource utilization, delayed data access, or publication of data that pre-dates its ability to be stored in a knowledge base. Efficient methods for identifying areas where data have been published but not yet curated are important for curators of data resources and users of those data resources alike. In this manuscript, the ZFIN gene expression data set was used as a test case to develop such a method. This method should be broadly applicable to any data set of sufficient size, as long as the proper predictive features can be identified. In the case of the ZFIN gene expression data, which has been captured from published literature by expert curators since 2005, the number of journal publications associated with a gene was an extremely good predictor of how many gene expression experiments a gene should have. This resulted in a simple linear model comprised of five variables. When the model was initially tested, genes associated with transgenic constructs were being reported with high significance as missing gene expression data, when in fact they were not. In some cases, genes associated with transgenic constructs had many dozens of publications associated with them which had no gene expression data for that gene. Perhaps the promoter of the gene was used in the construct for example, as is the case for the hsp70l gene. If that transgenic line was widely published, many publications ended up being associated with the gene because of the construct, even if there were no gene expression data for that gene in those publications. This led to the identification of the number of transgenic constructs per gene as an important variable in the model for those specific genes that were associated with constructs.
At ZFIN, every incoming zebrafish publication is associated with the genes discussed, even though not all the publications are fully curated. Hence, in ZFIN, the complete available literature across all genes is well represented, and thus the volume of published literature about a gene has a positive correlation with the amount of published data which exists for a gene. However, not all biological knowledgebases gather data using the same strategies. The method described here may not work as well for datasets that have more heterogenous representation of the published literature or other key variable. For example, a database which is populated with data by searching the literature for information about a specific record (gene, protein, etc.) may have deep representation of existing literature on the subset of records which have been researched and shallow representation of existing literature on other records. Heterogeneity of literature coverage of this type would detract from the predictive value of pure literature counts as were used for the ZFIN example. In such cases, other types of predictive variables would need to be identified through data exploration and feature engineering. These may include things such as the number of days since the last record update, number of data types associated, the presence of publications in specific journals, and presence of other potentially correlated data types. In some cases, it may be helpful to bring in additional data from external sources that can be linked to the data being examined. For example, UniProt records may not be associated with the complete literature about a protein or the associated gene. UniProt data for zebrafish proteins could be combined with the literature set from ZFIN for each related gene. This may increase the predictive value of the count of publications for identifying protein records in UniProt that are missing a piece of data of interest. Creative variable engineering will always be a critical step in successful application of this method.
The method described here produces a binary classification of genes that are predicted to be or not to be missing expression data based on the residual values being inside or outside the 95% CI of the model. A binary classification model makes sense for this problem. Unlike a binary classification, regression models result in a real number prediction of the label, in this case the number of gene expression experiments per gene. The regression model has the added possibility of providing a quantitative metric whose magnitude may correlate with the level of incompleteness of the data set. Confirmation of that possibility will require significant effort which should be the subject of future work.
This method can provide curators with a list of genes having published gene expression data that is yet to be curated. Therefore, the high precision outcome is important as it ensures that curators spend time reviewing publications for genes that are missing data. The model resulted in a recall/sensitivity of 0.71 and 0.73 at the lower and upper 95% CI, meaning 71% and 73% of the genes that were confirmed to be missing gene expression data were identified. From the perspective of a data curator, modest recall is acceptable for this method because subsequent rounds of model training and testing could be executed to iteratively refine and complete the data set. Genes that were not identified as missing data in the initial round of training and testing would eventually be identified in subsequent cycles of training, testing, and data updating. From the perspective of a data consumer, it would be beneficial to correctly identify as many genes as possible which have incomplete gene expression data sets. If future work finds that the magnitude of the residuals correlates well with the amount of missing expression data, then the residual itself could be provided to downstream data consumers as a metric of data set completeness for every gene included in the test set.
Machine learning methods are having significant impact upon many areas of our experience as scientists. As the field of data science has matured, these methods have become powerful tools for analysis, interpretation, and utilization of the increasingly large and interrelated data sets available today including numeric, free text, and image data. This work provides a machine learning approach to monitor data set completeness. It is concluded that this method could be used to identify incomplete data sets of any type curated from published literature, assuming proper predictive variables can be identified to build an accurate model.
Three data files were combined to build the predictive model. All three are provided as supplementary files to this manuscript. The MachineLearningReport.txt file (Additional file 2) is a custom report consisting of one row per gene in the ZFIN database, generated on Nov. 29, 2016. Data columns included the ZDB-GENE ID, gene symbol, gene name, count of gene expression experiments, count of journal publications attributed for gene expression annotations, count of Gene Ontology annotations, and count of journal publications attributed for Gene Ontology annotations. The columns related to the Gene Ontology had no value for predicting the number of gene expression experiments, so they were excluded from further analysis.
The GenePublication.txt (Additional file 3) and ConstructComponents.txt (Additional file 4) files are generated daily at ZFIN and made available via the ZFIN downloads page (https://zfin.org/downloads). The GenePublication.txt file was obtained on Nov. 30, 2016. The columns were gene symbol, ZDB-GENE ID, ZDB-PUB ID, publication type, and PubMed ID when available. The ConstructComponents.txt file was obtained on Dec. 19, 2016 and included columns for the ZFIN construct ID, construct name, construct type, related gene ZDB-GENE ID, related gene symbol, related gene type, a relationship between the gene and the construct, and two ontology term IDs from the sequence ontology  to specify the type of construct and the type of related marker. For this study, the only data used was a count of constructs related to each gene, which was computed from the ConstructComponents.txt file.
Data preparation and modeling
Manipulations of input data files, feature selection and engineering, model building, training, evaluation, model selection, and final model scoring were all done using modules provided in Microsoft Azure Machine Learning Studio (https://studio.azureml.net) using a free workspace level account. Features per gene used to train and test the linear regression model included the gene symbol, the number of journal publications attributed for gene expression, the number of gene expression experiments (the label), total number of journal publications, the percentage of journal publications with curated expression data, and the number of transgenic constructs associated with each gene.
The set of all gene records in the ZFIN database (36,655 genes as of Nov. 29, 2016) was filtered to exclude genes that were unlikely to be useful in this analysis including withdrawn genes, microRNA genes, genes with a colon in the name (typically not yet studied), genes with symbols starting with “unm_” (typically not yet studied), and genes with no associated journal publications as determined by data from the GenePublication.txt file. Genes with more than 200 existing expression experiments were also excluded because they are already heavily annotated for gene expression, many were found to be anatomical marker genes of less interest for the purposes of this work (eg. egr2b), and their heavy annotation may give them undesirable leverage that could negatively affect model performance for genes of interest which may have few annotations. Those excluded genes having more than 200 expression experiments have red symbols in Fig. 3. The resulting gene set used as input for model training and testing included 9870 genes. Any null numeric values generated in the data during file joining were set to 0 using the AML Clean Missing Data module, and no duplicate rows were present. A stratified split keyed on the expression experiment count was used in the Split Data module to select 25% of the genes (2483 genes) for training the model and 75% (7387 genes) for scoring the model. The Linear Regression, Train Model, and Score Model modules were used to train and score the model. The Linear Regression module used the following parameters: Solution method: ordinary least squares; L2 regularization weight: 10; Include intercept term: unchecked; Allow unknown categorical levels: checked; Random number seed: 112. Model performance was assessed using the Azure Machine Learning Evaluate Model module. The trained model was used to predict the number of expression experiments for the 7387 genes that were not used in model training. The resulting prediction was appended as a new column to the input data set.
Analysis and data visualizations
Model results, including the input data plus the predicted number of expression experiments, for the 7387 genes were exported from Azure Machine Learning Studio as a tab delimited file and imported into Microsoft Excel for Mac v16.27 for data validation and analyses (Additional file 5). Residuals were calculated as the actual expression experiment count minus the number of expression experiments predicted by the model. The 95% confidence interval of the model, computed as 2 times the root mean squared error (RMSE), was used to establish significance of the residuals. Genes with residuals outside or inside the 95% confidence interval were then considered as being predicted to be missing or not missing expression annotation respectively. One hundred genes inside and outside the negative 95%CI were randomly selected for manual testing by sorting the genes in the Excel spread sheet based on a randomly generated number column and copying the first genes from each set into a new Excel sheet. To remain blinded during the evaluation step, those genes were randomized again as a set by sorting based on a randomly generated number column. That gene selection process was repeated for 50 genes inside and outside the positive 95% CI. A manual evaluation was then done for each journal publication not already curated for expression data that was associated with each of the selected genes. The publications for each gene were sorted oldest to newest based on publication date and were then evaluated in order, starting with the oldest publications. Publication assessment for each gene continued until either all the publications were examined for a gene or a publication with missing expression data for that gene was identified, whichever came first. The result was recorded along with the assessment date and the ZDB-PUB ID for the publication that was missing the expression data, if one was found. The results of this data validation was used to produce a confusion matrix describing model precision and recall around the upper or lower 95% CI.
Publication records in ZFIN each have a unique ZDB-PUB ID, for example ZDB-PUB-161203-17. The first six digits indicate the date the record was created in YYMMDD format. Those data were parsed out of the list of IDs for publications that were recorded as containing uncurated gene expression data. The year component was then used to group those data to get a count of the number of genes per year that were found to have uncurated gene expression data. Even though it was the year of publication entry into ZFIN that was being counted, only the first paper encountered with uncurated expression data was recorded per gene, so the count is equal to the number of genes in the sample having uncurated expression data from each year.
Data visualizations were created using both Excel and Tableau Desktop Professional Edition v10.1.4.
I would like to thank Monte Westerfield for his support of this work, Sierra Taylor Moxon for generating the MachineLearningReport.txt file, and the rest of the ZFIN team for their consistent dedication to producing and maintaining a high-quality resource at ZFIN.
This work was supported by grant U41 HG002659 from the National Human Genome Research Institute of the National Institutes of Health (https://www.nih.gov). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Availability of data and materials
The datasets supporting the conclusions of this article are included within the article and its additional files.
DH created the study design, and conducted the data collection, analysis, validation, and writing of the manuscript. The author read and approved the final manuscript.
Ethics approval and consent to participate
Consent for publication
The author declares that he has no competing interests.
- 2.Shkurin A. Vellido a. Using random forests for assistance in the curation of G-protein coupled receptor databases. Biomed. Eng. Online. England. 2017;16:75.Google Scholar
- 4.Marchese Robinson RL, Lynch I, Peijnenburg W, Rumble J, Klaessig F, Marquardt C, et al. How should the completeness and quality of curated nanomaterial data be evaluated? Nanoscale. England. 2016;8:9919–43.Google Scholar
- 5.Howe DG, Bradford YM, Eagle A, Fashena D, Frazer K, Kalita P, et al. The Zebrafish Model Organism Database: new support for human disease models, mutation details, gene expression phenotypes and searching. Nucleic Acids Res. 2017;45:D758–68. http://www.ncbi.nlm.nih.gov/pubmed/27899582.
- 6.Alam-Faruque Y, Hill DP, Dimmer EC, Harris MA, Foulger RE, Tweedie S, et al. Representing kidney development using the gene ontology. PLoS One. 2014;9:e99864. https://www.ncbi.nlm.nih.gov/pubmed/24941002.
- 7.Ruzicka L, Bradford YM, Frazer K, Howe DG, Paddock H, Ramachandran S, et al. ZFIN, The zebrafish model organism database: Updates and new directions. Genesis. 2015;53:498–509. http://www.ncbi.nlm.nih.gov/pubmed/26097180.
- 8.Libbrecht MW, Noble WS. Machine learning applications in genetics and genomics. Nat. Rev. genet. 2015;16:321–332. http://www.ncbi.nlm.nih.gov/pubmed/25948244.
- 9.Müller H-M, Kenny EE, Sternberg PW. Textpresso: an ontology-based information retrieval and extraction system for biological literature. PLoS biol. 2004;2:e309. http://www.ncbi.nlm.nih.gov/pubmed/15383839.
- 10.Chen D, Müller H-M, Sternberg PW. Automatic document classification of biological literature. BMC bioinformatics. 2006;7:370. http://www.ncbi.nlm.nih.gov/pubmed/16893465.
- 11.Van Auken K, Fey P, Berardini TZ, Dodson R, Cooper L, Li D, et al. Text mining in the biocuration workflow: applications for literature curation at WormBase, dictyBase and TAIR. Database (Oxford). 2012;2012:bas040. http://www.ncbi.nlm.nih.gov/pubmed/23160413.
- 12.Fang R, Schindelman G, Van Auken K, Fernandes J, Chen W, Wang X, et al. Automatic categorization of diverse experimental information in the bioscience literature. BMC Bioinformatics. 2012;13:16. http://www.ncbi.nlm.nih.gov/pubmed/22280404.
- 13.Jiang X, Ringwald M, Blake J, Shatkay H. Effective biomedical document classification for identifying publications relevant to the mouse gene expression database (GXD). Database (Oxford). 2017;2017. http://www.ncbi.nlm.nih.gov/pubmed/28365740.
- 14.Adám A, Bártfai R, Lele Z, Krone PH, Orbán L. Heat-inducible expression of a reporter gene detected by transient assay in zebrafish. Exp. cell res. 2000;256:282–290. http://www.ncbi.nlm.nih.gov/pubmed/10739675.
- 15.Keseler IM, Skrzypek M, Weerasinghe D, Chen AY, Fulcher C, Li G-W, et al. Curation accuracy of model organism databases. Database (Oxford). 2014;2014. http://www.ncbi.nlm.nih.gov/pubmed/24923819.
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.