# CAT: computer aided triage improving upon the Bayes risk through *ε*-refusal triage rules

## Abstract

### Background

Manual extraction of information from electronic pathology (epath) reports to populate the Surveillance, Epidemiology, and End Result (SEER) database is labor intensive. Systematizing the data extraction automatically using machine-learning (ML) and natural language processing (NLP) is desirable to reduce the human labor required to populate the SEER database and to improve the timeliness of the data. This enables scaling up registry efficiency and collection of new data elements. To ensure the integrity, quality, and continuity of the SEER data, the misclassification error of ML and NPL algorithms needs to be negligible. Current algorithms fail to achieve the precision of human experts who can bring additional information in their assessments. Differences in registry format and the desire to develop a common information extraction platform further complicate the ML/NLP tasks. The purpose of our study is to develop triage rules to partially automate registry workflow to improve the precision of the auto-extracted information.

### Results

This paper presents a mathematical framework to improve the precision of a classifier beyond that of the Bayes classifier by selectively classifying item that are most likely to be correct. This results in a triage rule that only classifies a subset of the item. We characterize the optimal triage rule and demonstrate its usefulness in the problem of classifying cancer site from electronic pathology reports to achieve a desired precision.

### Conclusions

From the mathematical formalism, we propose a heuristic estimate for triage rule based on post-processing the soft-max output from standard machine learning algorithms. We show, in test cases, that the triage rule significantly improve the classification accuracy.

## Keywords

Machine learning Classification## Introduction

The Surveillance, Epidemiology, and End Result (SEER) collects and curates cancer patient data from electronic pathology (e-path) reports. Those data are used for population-based research on cancer trends to develop cancer treatment and screening policy recommendations. That information is extracted manually by experts, which is labor intensive. This limits the geographical coverage of the database, currently, about 30% of all cancer cases in the United States are included in the SEER database.

Automating the data extraction using Machine Learning (ML) and Natural Language Processing (NLP) tools is desirable to reduce the human labor that is required to populate the SEER database, thereby increasing the geographical coverage of the database and improving the timeliness of the data. To insure the integrity, quality and general usefulness of that database, the classification algorithm needs to have a misclassification error that is no more than that of human experts. This is challenging because these experts supplement the pathology reports with extra information not always available to the machine learning tools. Thus, it is implausible that, even under the best circumstances, the ML/NLP algorithms can achieve the same small error rate achieved by human experts. Differences in registry format and the desire to develop a common information extraction platform further complicate the ML/NLP tasks.

The ML/NLP task of extracting information from e-path reports can be cast as a multi-class classification problem. The misclassification error of any machine learning classification algorithm is bounded from below by the Bayes risk [1]. In some instances, that lower bound may exceed the accuracy needed by the end user, thereby making machine learning not useful.

The classification depends on both the content and the context of pathology report. This leads to heterogeneity in the degree of difficulty for classifying electronic pathology reports, with some pathology reports being easy to classify while others being harder. If one can identify which pathology reports are easy/hard to classify, i.e., have small/large expected misclassification error, one may improve upon the Bayes error by only classifying automatically the reports that have small expected misclassification error, leaving to the experts the task of classifying the more challenging ones.

We call *triage* machine learning algorithms that selectively classify item. We show in this paper that optimal triage classification rules achieve misclassification error that are lower than the Bayes classifier, at the cost of not classifying a fraction of the items. The lower misclassification rate arises because we do not get penalized for refusing to classify, since these reports will be evaluated by experts. By strategically refusing not to classify a large fraction of the items, we have the opportunity to achieve arbitrarily small misclassification rate. But there are resource constraints that limit the fraction of reports we refuse to classify. For example, ignoring the issue of building an appropriate infrastructure, if we wanted the SEER database to cover 100*%* of population the USA with the same manpower as now, we could refuse to evaluate at most 30% of the reports.

In this paper, we present the mathematical foundation for Computer Aided Triage (CAT) by showing how it expands on existing concepts of statistical machine learning. The paper is structured as follows: we first formalize the notation and define mathematically the *ε*-triage rule. We, then, prove that an *ε*-triage rule has monotonic decreasing classification error with increasing refusal fraction *ε* and characterize the optimal “Bayes” *ε*-triage rule, and we relate it to the classical Bayes rule. In the next Section, we apply heuristics derived from these results to post-process deep learning classification of cancerous tumor sites from electronic pathology reports to achieve a desired level of confidence. In a fourth section, we provide another application of this methodology to risk management. Finally, the interested reader will find the proofs in the appendix.

## Mathematical formulation

### Preliminaries

*X*

_{1},

*Y*

_{1}),…,(

*X*

_{n},

*Y*

_{n})}

*n*independent identically distributed random vectors with joint distribution

The covariates \(X_{i} \in {\mathbb X} \subset {\mathbb R}^{p}\) while the response variable \(Y_{i} \in {\mathcal A} = \{ a_{1},\ldots,a_{m} \}\) is one of *m* labels.

*a*

_{1},…,

*a*

_{m}}. To any classifier \(\hat Y(x)\), we can associate a partition

*A*

_{1},…,

*A*

_{m}of the feature space \({\mathbb X}\), defined by \(A_{k} = \left \{ x \in {\mathbb X} : \widehat Y(x) = a_{k} \right \}\). The partition of the Bayes classifier

*Y*

^{⋆}(

*X*), which minimizes the misclassification error \({\mathbb P}\left [\widehat Y(X) \not = Y\right ]\) amount all measurable functions \(\widehat Y(X)\), is given by

with the convention that if there exists two or more indexes for which we have equality, *x* is assigned to set with the lowest index to ensure that the sets \(A_{k}^{\star }\) partition the feature space \({\mathbb X}\). See [1] for example.

### Optimal triage rule

While the Bayes error is a lower bound for the misclassification error, the optimal triage rule will have a lower misclassification error. Let us formally define the *ε*-*triage* rule.

### **Definition 1**

*ε*-triage is a function

*T*from the feature space \({\mathbb X}\) into the extended set of labels {

*a*

_{1},…,

*a*

_{m}}∪

*∅*, where the label

*∅*represents the “no classification” category, and

Unlike the sets *A*_{1},…,*A*_{m} defined for a classifier, the sets *B*_{1},…,*B*_{m} do not form a partition of \({\mathbb X}\). As a result, let us define the decision set \(D = \cup _{k=1}^{m} B_{k}\), and the rejection set *D*^{c}={*x*:*T*(*x*)=*∅*}.

*D*. That is, a refusal to classify does not get penalized. Formally, the loss function for a triage rule

*T*is

*L*(

*T*) can be made arbitrarily small by making

*D*small enough. To avoid this uninteresting answer, we constraint the size of the decision set

*D*to satisfy

Our first theorem characterizes the minimizer of (2), thus providing an analogous result to Bayes rule in the classical classification context.

### **Theorem 1**

*T*

^{⋆}that minimizes \( {\mathbb P}[Y \not = T(X), X \in D]\) subject to \({\mathbb P}[X \in D] \geq 1-\varepsilon \), for a given 0<

*ε*<1, is characterized by the sets

*b*is the smallest value such that

**Remark** Observe that the sets *D*^{⋆}(*b*)⊂*D*^{⋆}(*b*^{′}) when *b*≥*b*^{′}, or equivalently, when *ε*≤*ε*^{′}. As a result, the loss of the optimal triage rule *L*(*T*^{⋆}) is a monotone decreasing function in *ε*.

It is insightful to specialize Theorem 1 to characterize the optimal triage rule for binary labeled features.

### **Corollary 1**

*Y*∈{0,1} takes on only two values. Then the indecision region for the optimal triage rule is

where *b* is the largest value such that \({\mathbb P}[X \in D] \geq 1- \varepsilon \).

A similar “indecision set”, based on the likelihood ratio, arises in sequential learning [2].

### Relationship to Bayes rule

It is instructive to relate the optimal triage rule to the Bayes rule. To this end, denote by *Y*^{⋆} the Bayes rule.

### **Proposition 1**

*T*

^{⋆}in terms of the Bayes classifier

*Y*

^{⋆}. To this end, define the function

*a*

_{k}×

*∅*=

*∅*. The optimal triage rule can then be written as

This allows us to reinterpret Theorem 1: The optimal triage rule is the Bayes rule on the restricted set where largest conditional probability max*j**p*(*a*_{j}|*x*)>*b*. That is, the optimal triage is Bayes rule provided that the conditional probability of the winning class is large enough. As a consequence, it is possible that a triage rule does never classifies a particular class if the conditional probability *p*(*a*_{k}|*x*)<*b* for all *x*. Identification of such classes is instructive, as it identifies difficult classes. to classify.

## Heuristic: using soft-max to build a triage rule

The previous section describes how the optimal triage rule is a thresholding function of the (optimal) Bayes rule. In this section, we propose the heuristic triage estimator obtained by post-processing the soft-max produced by various machine learning algorithms. Looking at Eq. (7), we propose to use a classifier that produces a soft-max to define sets that mimmic the sets *D*^{⋆} and \(B_{k}^{\star }\). If the soft-max output from a ML algorithm is a good estimate for some monotone increasing transformation of the actual conditional probabilities, then this heuristic will produce good triage rules. Other estimation strategies are possible, and will be explored in a future manuscript.

*h*(

*a*,

*x*) of the class labels \(a \in {\mathcal A}\) and features \(x \in {\mathbb X}\). For any function \(h \in {\mathcal H}\), the soft-max is defined by

*p*(

*a*

_{k}|

*x*), even for consistent classifiers. However, the soft-max of a consistent classifier are such that the sets

*b*

^{′}such that the set

asymptotically equals *D*^{⋆} as the sample size tends to infinity. To this end, consider the following definition:

### **Definition 2**

*M*-consistent (or monotone consistent) for the conditional class probability

*p*(

*a*

_{k}|

*x*) if

where *ϕ* is a monotone increasing function and the convergence is in probability.

Under that assumption, it is possible to construct, via post-precessing of the soft-max, consistent triage rule from an *M*-consistent soft-max estimator.

## Example

### A first example

As an illustrative example, we combine deep learning and natural language processing algorithms to classify the primary cancer site, as described by the ICD-O-3 classification manual [3], of 22571 electronic pathology reports from the Louisiana SEER catchment area. The fitted algorithm return a soft-max value for 139 distinct primary cancer site. With minimal optimization, the false positive rate for the classifier is 24.64%, which is significantly higher than the advertised less than 5% classification error from manual classification. This presents us with an opportunity to demonstrate the practical usefulness of triage.

*j*

*p*(

*a*

_{j}|

*x*), which is consistent with the

*M*-consistency assumption.

### Ad hoc improvement strategy

## Conclusion

This paper shows that one can improve a classifiers performance by selectively classifying items, and that the optimal triage rule can be expressed in terms of the Bayes rule. Such triage rule are useful when one seeks machine learning algorithms that achieve a prescribed precision, as is the case in automatic annotation of electronic pathology reports.

## Proofs

### Proof of theorem 1

*T*is

*D*, the misclassification error is minimized by taking

To ensure that *B*_{k}∩*B*_{j}=*∅* for *k*≠*j*, we apply the convention that if there exists two or more indices for which we have equality, *x* is assigned to set with the lowest index.

*D*that minimizes the misclassification error subject to the constraint that \({\mathbb P}[X \in D] \geq 1- \varepsilon \), consider all triage rules that satisfy (13)

*D*is of the form

*b*is the smallest value for which the constraint \({\mathbb P}[ X \in D] \geq 1-\varepsilon \) is satisfied. The conclusion follows by expressing the ratio

## Notes

### Acknowledgements

The authors would like to thank Lynne Penberthy from NIH from bringing to ur attention the SEERS classification problem.

### Funding

NH and TB were supported by the National Nuclear Security Administration’s Advanced Simulation and Computation program. LC and NH were also supported by CRADA with Ernst and Young. This work has been supported in part by the Joint Design of Advanced Computing Solutions for Cancer (JDACS4C) program established by the U.S. Department of Energy (DOE) and the National Cancer Institute (NCI) of the National Institutes of Health. This work was performed under the auspices of the U.S. Department of Energy by Los Alamos National Laboratory under Contract DE-AC5206NA25396 and Oak Ridge National Laboratory under Contract DE-AC05-00OR22725. Publication costs were funded by the JDACS4C program.

### Availability of data and materials

The raw epath reports were accessed as parts of a Data Use Agreement.

### About this supplement

This article has been published as part of BMC Bioinformatics Volume 19 Supplement 18, 2018: Selected Articles from the Computational Approaches for Cancer at SC17 workshop. The full contents of the supplement are available online at https://bmcbioinformatics.biomedcentral.com/articles/supplements/volume-19-supplement-18.

### Disclaimer

This report was prepared as an account of work sponsored by an agency of the U.S. Government. Neither Los Alamos National Security, LLC, the U.S. Government nor any agency thereof, nor any of their employees make any warranty, express or implied, or assume any legal liability or responsibility for the accuracy, completeness, or usefulness of any information, apparatus, product, or process disclosed, or represent that its use would not infringe privately owned rights. Reference herein to any specific commercial product, process, or service by trade name, trademark, manufacturer, or otherwise does not necessarily constitute or imply its endorsement, recommendation, or favoring by Los Alamos National Security, LLC, the U.S. Government, or any agency thereof. The views and opinions of authors expressed herein do not necessarily state or reflect those of Los Alamos National Security, LLC, the U.S. Government, or any agency thereof. This manuscript has been authored by UT-Battelle, LLC under Contract No. DE-AC05-00OR22725 with the U.S. Department of Energy. The United States Government retains and the publisher, by accepting the article for publication, acknowledges that the United States Government retains a non-exclusive, paid-up, irrevocable, worldwide license to publish or reproduce the published form of this manuscript, or allow others to do so, for United States Government purposes. The Department of Energy will provide public access to these results of federally sponsored research in accordance with the DOE Public Access Plan (http://energy.gov/downloads/doe-public-access-plan).

### Authors’ contributions

NH and TB conceived of this project; NH, LC, GT, JQ, BC, and TB carried out the research; GT, X-CW, and TB made the data available. NH wrote the paper with TB, all authors agreed to the text. All authors read and approved the final manuscript.

### Ethics approval and consent to participate

The data was obtained from the Louisiana registry under data use agreements, and the research was approved by the institutional review boards.

### Consent for publication

Not applicable.

### Competing interests

The authors declare no competing interests.

### Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

## References

- 1.Devroye L, Gyorfi L, Lugosi G. A Probabilitic Theory of Pattern Recognition. New York: Springer; 1996.CrossRefGoogle Scholar
- 2.Wald A. Sequential Analysis. New York: Wiley; 1947.Google Scholar
- 3.Organization WH. International Classification of Diseases for Oncology, Third edition, first revision. Geneva: WHO press; 2013.Google Scholar
- 4.Neyman J, Pearson ES. On the Problem of the Most Efficient Tests of Statistical Hypotheses. In: Philosophical Transactions of the Royal Society of London. Series A, Vol. 231: 1933. p. 289–337.Google Scholar

## Copyright information

**Open Access** This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.