Taxonomy based performance metrics for evaluating taxonomic assignment methods
Metagenomics experiments often make inferences about microbial communities by sequencing 16S and 18S rRNA, and taxonomic assignment is a fundamental step in such studies. This paper addresses the weaknesses in two types of metrics commonly used by previous studies for measuring the performance of existing taxonomic assignment methods: Sequence count based metrics and Binary error measurement. These metrics made performance evaluation results biased, less informative and mutually incomparable.
We investigated weaknesses in two types of metrics and proposed new performance metrics including Average Taxonomy Distance (ATD) and ATD_by_Taxa, together with the visualized ATD plot.
By comparing the evaluation results from four popular taxonomic assignment methods across three test data sets, we found the new metrics more robust, informative and comparable.
KeywordsMetagenomics Classification Performance evaluation Data analysis
Average Taxonomy Distance
RDP Naive Bayesian Classifier
Taxonomic assignment using 16S and 18S rRNA gene classification
A fundamental step in microbiota studies is taxonomic assignment, in which each sequence or “read” in the study sample is assigned a taxonomic label . The most common method for taxonomic assignment is to sequence the 16S and 18S rRNA genes as biomarkers, and there are several methods for doing this, including the RDP Naive Bayesian Classifier  (hereafter RDPNBC), K-Nearest Neighbor, SINTAX , TACOA , Taxator-tk , Kraken  and 16S Classifier . Method performances are (cross-) validated on popular databases and have been characterized as having different strengths. Vinje et al.  compared performances for several k-mer based taxonomic assignment methods and found that the k-mer based methods that they used approach an error plateau.
Challenges in performance evaluation
Taxonomy Choice: Classification results using different taxonomic databases cannot be directly compared . Since different sets of reference sequences and nomenclatures (e.g., Bergey’s, NCBI) are used, they might give the same taxonomic assignment for different query sequences or vice versa. Besides, taxonomic names are changed (or updated) as new microorganisms are identified, which makes the results even less consistent.
Testing Data: Data from communities differ from context to context (human gut, soil…etc.), and there are currently no standard testing data for each context. Previous studies derived their evaluation results by performing cross-validation on existing 16S and 18S rRNA databases such as RDP , Greengenes  and SILVA . The Critical Assessment of Metagenome Interpretation (CAMI)  open-access platform also provides specially-generated data sets for benchmarking.
Reference database coverage: microbial marker genes such as 16S and 18S rRNA correspond to only a small fraction of species’ taxonomic names and known sequences . Taxonomic assignment methods cannot learn the patterns from unseen taxa, regardless of their performance.
Performance Metrics: After cross-validating on databases, one may summarize the test results with some performance metrics such as accuracy, precision or recall. The different choices of metrics also reflect different viewpoints for the task and would reflect heavily on how researchers interpret the performance evaluation results. We believe that good taxonomic assignment performance metrics could help make inferences on the absolute performance given the known reference sequences and compare the performances among different methods. It is also worth mentioning that performance metrics are separate from the first three challenges because they have a stronger connection to referencing data sets. Performance metrics will always be the final direct performance reference for taxonomic assignment methods.
Two weaknesses of performance metrics in previous studies
Previous studies showed that most taxonomic assignment algorithms could achieve around 90% accuracy when choosing genus as its classification target rank. High accuracy, however, does not necessarily imply high performance. Here, we illustrate two weaknesses of the performance metrics used by previous studies: Sequence Count Based Metrics and Binary Error Measurement.
Sequence count based metrics
Biased Performance Evaluation
With sequence count based metrics, one may assume that the taxa distributions in databases are similar to those in samples, but this is usually not true in practical microbiota research. With regard to imbalanced data sets, the sequence count based metrics are just measuring how well a method performs based on a few specific taxa with high sequence frequency in a database, not its ability to recognize every taxon.
Incomparable Evaluation Results
To address the problem of frequent taxa, some previous studies resorted to “pruning” (undersampling) large taxa in databases to make the sequence counts for each taxon even [7, 8]. This strategy alleviates the imbalances in databases while trading off the sequence diversities for the pruned taxa, making the database coverage even poorer. Nevertheless, different undersampling methods in different studies make experiment results between studies mutually incomparable. Also, the vagueness in descriptions on how this pruning was done made the experiments less repeatable and reproducible.
Replacing the “taxa distributions in databases and samples are similar” assumption, we normalize taxa distributions by weighting each taxon equally in performance metrics to reflect a classification method’s recognition capabilities. We aim to give equal treatment to the prediction results of each taxon while avoiding resampling, which tends to make questionable adjustments to the original databases. In contrast to sequence count based metrics, this approach can be considered as “taxon count based metrics”. This concept has also appeared in some recent work [4, 5, 13].
Binary error measurement
Loss of Information
The assumption behind the binary error measurement is that all taxa (taxonomic labels) are equally different from one another. But such an assumption does not bode well with the very nature of tree-based taxonomy where we view taxonomic assignment as a hierarchical classification (HC) problem. A hierarchical performance measure should use class hierarchy to properly evaluate HC algorithms .
Most previous studies made independent binary evaluations at each rank, in which performances were measured separately with different taxonomic ranks as classification targets [2, 4, 5, 13]. This design does not fully deploy the concept of HC, leading to loss of information as explained below.
When setting a high rank as the classification target, the evaluation result loses the information about whether a method is capable of differentiating the taxa in lower ranks. However, when setting a low rank as the classification target, we face the issue of singletons. Since singletons cannot be correctly classified, some previous studies discarded these predictions in statistics (making results overly optimistic).
Nevertheless, no taxon is completely novel in a taxonomic tree. Therefore, a method could still make generalized predictions on singletons. Discarding or ignoring them actually leads to the data diversities shrinking further and losing information on the performance on these variations.
Incomparable Evaluation Results
Previous studies viewed singletons as unavoidable (equal-degree) errors and used various treatments on these sequences. Therefore, using binary error not only caused loss of information, but raised the redundant issue for treatments on singletons, making the evaluation results incomparable.
We found that inconsistencies also existed within the same studies. For example, Fig. 1 in Wang et al.’s study  suggested one would lose merely 3% accuracy when changing the target rank from family to genus, but the evaluation results were actually based on different data sets (i.e., with different set of records recognized as singletons).
We use the two solutions to propose a new set of performance metrics, together with a visualized plot, and reevaluate the performance of a few taxonomic assignment methods on three databases.
Taxonomy based performance metrics
Per-prediction error: Taxonomy Distance
For a given query sequence, a taxonomic assignment method gives a taxonomic label as a prediction. The Taxonomy Distance in a prediction is TD as defined above.
Per-taxon error: Average Taxonomy Distance
Overall performance: ATD_by_Taxa
Error Rate (by taxa) and ATD (by seq)
We also derived two metrics to compare the effects of our two solutions: Err_by_taxa and ATD_by_seq, which use only one of the two solutions. Error rate (by taxa) is a taxon count based metric that uses binary error for each prediction, in which error rates are calculated for each taxon and then averaged. ATD (by seq) is a metric using Taxonomy Distance, but with no reference to taxon count; it is simply the mean of TDs among the predictions.
Visualizing through an ATD plot
Data and taxonomic assignment methods
10-fold cross-validation and macro average
We used stratified 10-fold cross-validation for this study to reduce the outcome variance and bias across the folds . In keeping with the uniform taxa distribution assumption, we performed macro-average  rather than micro-average when summarizing ATDs for each taxon. That is, rather than calculate performance metrics for each data fold and average them, we first aggregated all the TDs from the data folds, then calculated ATDs for each taxon.
Summary for the rRNA gene databases used for this study
16S & 18S
This study used the full length 16S and 18S rRNA gene sequences throughout the training and testing processes, and no singletons or other sequences were discarded from the databases so as to keep results comparable and maintain sequence variation.
Taxonomic assignment methods
Settings for the chosen taxonomic assignment methods
numwanted = 1
cutoff = 0
cutoff = 0
The effects of taxon count based metrics and taxonomy distance
The top left plot shows that most sequences in the database resulted in correct predictions and around one-tenth of the sequences contained errors. Among the 12% error rate, 8% were singletons and 4% were non-singletons, which was compatible with the evaluation results from previous studies.
The top right plot shows the effect of switching from sequence count based metrics to taxon count based metrics. The overall error rate weighted by taxon count was 50%, showing that, though RDPNBC could correctly classify 88% of sequences in the database in cross-validation, those correct predictions only represent the capability of classifying half of the taxa. Here we see sequence count based metrics were biased toward the performance on majority taxa and failed to represent recognition capabilities.
The bottom left plot shows the effect of switching from binary error measurement to Taxonomy Distance. 88% of sequences had 0 TD corresponding to the 0-error sequences in the top left plot. Most of the 12% of sequences with errors were actually 0.16 TD (1-rank error). Here we see that Taxonomy Distance provides more detailed information on incorrect predictions and that singletons are not unavoidable errors.
The bottom right final ATD plot shows the ATDs across the taxa. We again see that most of the 0 TDs in the bottom left plot were from majority taxa and—though RDPNBC was perfectly correct on only half of the taxa in the database—most of the errors in the remaining taxa were 1-rank errors. The overall performance—ATD_by_taxa—was 0.11, showing expected half rank error for each prediction. The deployment of taxon count based metrics and Taxonomy Distance gave more robust and informative evaluation results.
Method performance and best performance
There is a difference between “how good the method is” and “how close the method is to perfection”. By comparing the evaluation result to best performance, we can get the idea of how close a classifier is near to perfect and identify the difficult and important cases that algorithm designers need to work on. Here we describe how new metrics could work better for such a purpose.
When using binary error measurement, the behavior of the ideal (hereafter Plateau) algorithm can be described as: (1) If a taxon T is also presented in training data, predict T. (2) Else, get an error.
Considering Taxonomy Distance, the Plateau algorithm’s behavior can be defined in a more delicate form: (1) If a taxon T is also presented in training data, predict T. (2) Else, generate a prediction with min TD from taxonomy labels in training data.
Note that we use the verb “generate” to indicate that the prediction with the min TD was not necessarily the taxonomy label that had the min TD in training data. In some cases, the min TDs come from trimmed taxonomy labels. For example, suppose the training data contained only one single sequence with taxonomic label “orderA;familyB;genusC;speciesD”. When a classifier tries to make prediction on a sequence with the actual taxonomic label “orderA;familyB;genusE”, it can definitely not make an error-free prediction since there is no such taxonomic label in the training data. However, the best prediction with the smallest TD given the training data mentioned above was not “orderA;familyB;genusC;speciesD”, which would have 2/4 TD, but “trimmed” taxonomic label “orderA;familyB;” or “orderA;familyB;genusC” with 1/3 TD.
The second row shows the ATD plots for RDPNBC, Plateau and their ATD differences (paired by each taxon). RDPNBC achieved 1/2 taxa error-free, 1/3 taxa one-rank error, and 4/5 of the taxa with error Plateau. For algorithm designers, this result not only points out what could or should be improved, but how much improvements may influence overall recognition capabilities. Here, we conclude ATD and ATD plot consider both recognition capabilities and correctness measure.
(See Supp. for testing other methods on other databases)
Method performance comparison
The results of testing on RDP were compatible with experimental results in previous studies. All metrics show that the same performance ranking order and all methods, except KNN, were nearly equally good and closed to the Plateau.
When testing on Greengenes, Error rate (by seq) and ATD (by seq) showed that KNN significantly outperformed RDPNBC and SINTAX. However, standard deviations for these two metrics suggest that KNN is more prone to having unexpectedly large errors for some predictions than RDPNBC and SINTAX. Here, we see that binary error measurement leads to loss of information. On the other hand, KNN gets a decent 0.165 Error rate (by seq) but a high 0.745 error rate (by taxa). This shows that there is a high imbalance of taxa in Greengenes and sequence count based metrics favor methods that are good at recognizing majority taxa.
ATD_by_Taxa shows stable performance rankings “Plateau, 1NN, RDPNBC, SINTAX, KNN”, regardless of the databases used. There was still space for improvement.
This study brings up taxa count based metrics and Taxonomy Distance to address the weaknesses in previous metrics. Kosmopoulos et al.  characterized existing metrics for evaluating HC algorithms into two classes: pair-based and set-based. Pair-based measures assign costs to pairs of predicted and true classes as the minimum distance in the tree hierarchy. Set-based measures are based on operations in the entire sets of predicted and true classes. The TD mainly uses the concept of set-based calculation.
The UniFrac metric shown in the taxonomic profiling challenge in Sczyrba et al.’s study  is more of taking the paired-based metric approach, calculating the minimum distance between the true taxonomic label and the predicted label in a taxonomic tree. Both Taxonomy Distance and UniFrac distance take advantage of the hierarchy in the taxonomic tree. Compared to UniFrac, TD puts more emphasis on higher-rank prediction errors, such as TD(T4, T5) > TD(T1, T2) in Table 2, and less on over-specialization cases. For example, suppose the actual taxonomic label is T4 in Table 2. T6 has 2 more lower ranks in the label than T5. The UniFrac distances for (T4, T5) and (T4, T6) are 2 and 4, respectively, being proportional to edge differences. On the other hand, TD(T4, T5) and TD(T4, T6) would be 1/2 and 3/4, respectively, reflecting more on rank differences.
Set-based HC metrics have hierarchical precision, recall and F-measure, as presented by Kosmopoulos et al. . Nevertheless, hierarchical recall cannot reflect the over-specialization cases and hierarchical precision cannot reflect under-specialization ones. F-measure combines precision and recall but is less intuitive than TD, which centers around the concept of rank error.
However, TD also made the evaluation results highly dependent on taxonomy choice. The Taxonomy Distances might differ when using different databases. Some analysis platforms, such as Mothur, also made their own adjustments to taxonomic ranks. This also influences the calculation for Taxonomy Distance.
There are three things to notice when using Taxonomy Distance. First, we assume that the dissimilarity between taxa is proportional to their rank difference. Second, Taxonomy Distance is influenced by the number of ranks for the two taxa. Third, the concept of ATD is more like recall rate because the average is calculated by the true classes.
In addition to addressing concerns about taxonomy, for further studies we plan to evaluate performances of other taxonomic assignment methods, other biomarkers and other data sets. We also expect to make further biological interpretations based on those results.
More robust: Taxon count based metrics give equal weight to each taxon and focus on recognition capabilities; they are therefore less prone to imbalanced databases.
More informative: Taxonomy Distance adopts the concept of taxonomic hierarchy and differentiates incorrect predictions.
More comparable: Taxon count based metrics solve the controversial problem of pruning large taxa and Taxonomy Distance clears the problem of whether to exclude singletons before or after testing.
The sequence count based metrics with binary error measurement used by previous studies imply the “same taxa abundance distribution to database” and “all different taxa are mutually equally different” assumptions. This makes performance evaluation and comparison results biased and less informative. This study proposes that ATD and ATD_by_Taxa, together with an ATD plot, avoid these problems.
The authors are grateful to Kun-Nan Tsai, Yu-Hsuan Ho, Hsin-Min Lu and Galit Shmueli for helpful discussions.
CYC developed all needed computer programs for this study, conducted the experiments and drafted the manuscript. SCC revised the manuscript substantially. SLT and SCC supervised this study. All authors contributed to the formulation of the problem, presentation of the results, and structure of the paper. All authors have read and approved the final manuscript.
This work was partially supported by the Ministry of Science and Technology of Taiwan under Grant 105–2410-H-002-101-MY3.
Ethics approval and consent to participate
Consent for publication
The authors declare that they have no competing interests.
- 1.Soueidan H, Nikolski M. Machine learning for metagenomics: methods and tools. Quantit Biol. 2016 .arXiv:1510.06621v2 [q-bio.GN]. https://doi.org/10.1515/metgen-2016-0001.
- 3.Edgar RC. SINTAX: a simple non-Bayesian taxonomy classifier for 16S and ITS 553 sequences. bioRxiv 074161; https://doi.org/10.1101/074161.
- 13.Sczyrba A, Hofmann P, Belmann P, Koslicki D, Janssen S, Dröge J, Gregor I, Majda S, Fiedler J, Dahms E, Bremges A, Fritz A, Garrido-Oter R, Jørgensen TS, Shapiro N, Blood PD, Gurevich A, Bai Y, Turaev D, DeMaere MZ, Chikhi R, Nagarajan N, Quince C, Meyer F, Balvočiūtė M, Hansen LH, Sørensen SJ, Chia BKH, Denis B, Froula JL, Wang Z, Egan R, Kang DD, Cook JJ, Deltel C, Beckstette M, Lemaitre C, Peterlongo P, Rizk G, Lavenier D, Yu-Wei W, Singer SW, Jain C, Strous M, Klingenberg H, Meinicke P, Barton MD, Lingner T, Lin H-H, Liao Y-C, Silva GGZ, Cuevas DA, Edwards RA, Saha S, Piro VC, Renard BY, Pop M, Klenk H-P, Göker M, Kyrpides NC, Woyke T, Vorholt JA, Schulze-Lefert P, Rubin EM, Darling AE, Rattei T, McHardy AC. Critical assessment of metagenome interpretation—a benchmark of metagenomics software. Nat Methods. 2017;14:1063–71.CrossRefGoogle Scholar
- 17.Iram S, Jumeily DA, Fergus P, Hussain A. Exploring the hidden challenges associated with the evaluation of multi-class datasets using multiple classifiers, vol. 2014. Birmingham: Eighth International Conference on Complex, Intelligent and Software Intensive Systems; 2014. p. 346–52.Google Scholar
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.