Background

The natural history of endometrial hyperplasia is not fully understood [1]. What is known is that a proportion of simple and complex hyperplastic processes will regress without treatment [2] although the time scale over which such regression may occur is unclear. Similarly the time scale over which benign endometrium progresses to hyperplasia is also unknown. Among studies evaluating accuracy of tests for diagnosis of hyperplasia (miniature biopsy or ultrasonography), it has previously been hypothesised that if histological verification of diagnosis after performing the test is delayed, the estimation of test accuracy may be influenced by the phenomena of disease regression or progression [3]. For instance, false positive diagnoses of endometrial hyperplasia may occur due to natural disease regression during the time interval between testing and verification of diagnosis. Similarly, false negative diagnoses may also result from progression of benign functional or atrophic endometrium.

To obtain accurate estimates of test accuracy in studies of hyperplasia, an immediate comparison of the test under scrutiny with a reference standard that verifies the diagnosis will be essential [46]. When accuracy studies suffer from a delay in performance of the reference standard, the resultant false positives and false negatives will be expected to lead to an underestimation of test accuracy. In systematic reviews, when studies of various designs are collated, the extent of underestimation that arises from delay is important in obtaining an unbiased pooled accuracy estimate. To our knowledge, the extent of underestimation of accuracy due to a delay in verification of diagnosis has not been evaluated empirically in studies of endometrial hyperplasia. We undertook this analysis to examine formally how inaccurate the estimation of accuracy can be in studies evaluating miniature endometrial biopsy devices and endometrial thickness measurement by pelvic ultrasonography for predicting endometrial hyperplasia when there are delays in histological verification of diagnosis.

Methods

To test our hypothesis, a data set of all the published studies reporting the accuracy of miniature endometrial biopsy devices and endometrial ultrasonography for predicting endometrial hyperplasia was obtained from systematic reviews [7, 8]. The reviews focused on test accuracy studies in which the results of the test were compared with the results of a reference standard. The targeted population was women with abnormal pre- or postmenopausal uterine bleeding. The diagnostic tests of interest were miniature endometrial biopsy devices (for example, pipelle® endometrial suction curette, Unimar, Wilton, CT, USA) and endometrial thickness measurement by pelvic ultrasonography. The reference standard was endometrial histology obtained by an independent endometrial sampling technique, for example, inpatient curettage (with hysteroscopy) or hysterectomy.

Identification of studies

Two independent electronic searches of MEDLINE and EMBASE were conducted to identify relevant citations on endometrial biopsy (1980–1999) and ultrasonography (1966–2000). Search term combination for endometrial biopsy [8] was diagnosis (MeSH) AND endometrial biopsy (textword), while that for studies on ultrasonography [7] was ultrasound AND endometrial thickness AND sonography (textwords). The searches were limited to human studies, but there were no language restrictions. Relevant studies were identified by examining all the retrieved citations, reference lists of all known reviews and primary studies, and direct contact with manufacturers. Details of the search and selection processes can be found in the published reports of the reviews [7, 8].

Study quality assessment

All selected studies were assessed for their methodological quality defined as the confidence that study design, conduct and analysis minimize bias in the estimation of diagnostic accuracy [911]. We considered the following features in quality assessment: method of recruitment of sample, appropriateness of patient spectrum, and blinding of comparison between test and reference standard. Recruitment was considered to be adequate if patient selection was consecutive or a random sample was obtained. Patient spectrum was considered to be appropriate if both pre- and postmenopausal women were included. Blinding was considered to be present if it was clearly reported that the pathologists providing histological reports were kept unaware of the results of miniature endometrial biopsy or endometrial ultrasonography. If the results of the diagnostic tests were divulged to the pathologists or in the absence of any such reporting, blinding was categorised as absent. For the purpose of our analysis, studies were classified into two quality categories: Category I studies had any one of the following features: adequate recruitment, appropriate spectrum, and blinding; category II studies had none of the above quality features.

Data extraction

In addition to assessment of methodological quality, data were extracted to allow classification of studies into one of two groups: i) immediate verification – reference standard performed within 24 hours of testing, and ii) delayed verification – reference standard performed more than 24 hours after testing. Any studies that could not be categorised in this way due to lack of reporting were excluded. Data were then abstracted as 2 × 2 tables and estimates of diagnostic accuracy were derived for each individual study. A correction factor of 0.5 was used when cells of the 2 × 2 tables included zero values [12]. True positive rates (sensitivity), false positive rates (1-specificity) and diagnostic odds ratios (dORs) were calculated for each primary evaluation. The dOR represents a ratio of the positive and negative likelihood ratios and it can be mathematically summarised as:

dOR = [sensitivity/(1-specificity)] / [(1-sensitivity)/specificity]

Statistical analysis

Pooled dORs were generated as the principal measures of diagnostic accuracy. Meta-analyses to produce summary estimates of accuracy were performed separately for subgroups of studies reporting immediate and delayed verification. To delineate the impact of delay in verification of diagnosis, weused meta-regression analysis [13, 14] with the log of dOR as the accuracy measure. This technique fitted a multivariable linear regression model for examining the influence of delay, quality and test type on the estimation of accuracy observed among studies included in the analysis (random effects model). In this way the analysis was adjusted for the confounding effects of study quality (two quality categories) and type of test (miniature endometrial biopsy or endometrial ultrasound).

Results

Selection of studies

The study selection process is shown in Figure 1. In total there were 2,982 subjects in 27 diagnostic evaluations reported in the 24 eligible primary studies. Eleven evaluations delayed verification of the diagnosis by more than 24 hours; the delay was up to six months in one study, up to four weeks in four studies, up to three weeks in one study and up to one week in the remaining three studies. Three of these studies were rated as category I for methodological quality, and eight as category II. Sixteen evaluations verified the diagnosis within 24 hours of the test. Among these, seven studies were rated as category I for quality, and nine as category II (Table 1).

Figure 1
figure 1

Flow diagram showing study selection process.

Table 1 Study characteristics and methodological quality.

Table 2 shows the diagnostic accuracy results for individual studies according to test type and verification status in terms of delay. The summary statistics for the various subgroups showed that the dOR for studies with immediate verification was 67.2 (21.7–208.8) while that for studies with delayed verification was 16.2 (8.6–30.5) as shown in Figure 2. Meta-regression analysis for bias due to delay in verification of diagnosis, adjusted for study quality and test type, showed that the underestimation of test accuracy among studies with delayed verification was 74% (95% CI 7%-99%; P = 0.048) on average compared to studies with immediate verification (Table 3).

Figure 2
figure 2

Effect of delayed verification on the diagnostic accuracy of miniature endometrial biopsy and transvaginal ultrasound in detecting endometrial hyperplasia. Pooled diagnostic odds ratios (dOR) for studies with immediate and delayed verification.

Table 2 Accuracy stratified by time delay between test performance and confirmation by chosen reference test histology.
Table 3 Results of meta-regression analysis.

Discussion

Our study shows empirically the magnitude of bias associated with delay in verification of diagnosis in test accuracy studies. Delay in verification of more than 24 hours was associated with a considerable underestimation of accuracy of miniature biopsy and endometrial ultrasonography in diagnosing endometrial hyperplasia. This supports the premise that the reported limited accuracy of miniature endometrial biopsy devices and endometrial ultrasonography in diagnosing hyperplasia is due, in part, to natural history of disease rather than resulting entirely from intrinsic problems with performance of the diagnostic tools [3].

We posed our hypothesis a priori and tested it in as rigorous a manner as possible. Our literature search was without language restriction, facilitating retrieval of many relevant test accuracy studies. However, due to poverty of reporting many critical pieces of information were missing in the available literature, restricting the number of studies that could be included in our analysis (for example, 31 studies were ineligible for inclusion because explicit information about time before verification was omitted). Our examination of delays in verification was also restricted; just two time categories were discernible (delay < 24 hours or > 24 hours). Immediate verification (reference standard to be performed straight after the index test) was not achievable in some studies because the reference test (inpatient endometrial sampling) necessitated use of general anaesthesia. A practical cut-off of 24 hours was taken to allow time for reference testing to be undertaken when the preceding index tests (miniature endometrial biopsy and ultrasound) were performed in the conscious outpatient. Although the natural history of endometrial hyperplasia is unclear, it is unlikely that biological alteration would have occurred within 24 hours. To study the rate of disease progression or regression would require repeated testing over time, but such a study is unlikely to be ethically justifiable, given that most clinicians will institute treatment following initial diagnosis. Such a study would be then become one of prognosis under treatment rather than a natural history study.

We also evaluated other features of methodological quality and, in general, found the quality of studies to be poor. For example, only three studies reported blinding interpretation of the reference test from knowledge of results from the index test. A lack of blinding can introduce bias and overestimation of diagnostic accuracy [4]. Pathological interpretation of endometrial hyperplasia is open to a varying degree of subjectivity especially at extreme ends of the spectrum, where overlap with benign functional endometrium (simple hyperplasia) and cancer (complex hyperplasia with cytological atypia) is more likely. Absence (or explicit reporting) of blinding is thus associated with poorer methodological quality and this feature was incorporated in our quality assessment. Our analysis adjusted for the confounding effects of quality but our inferences should be interpreted with caution due to relative scarcity of good quality studies.

Conclusions

Our findings have implications for research into new diagnostic interventions. Our results demonstrate that test evaluation with robust study design (immediate verification) showed good test performance but evaluation in poor designs (delayed verification) showed poor performance. Poor designs may reflect the situation prevalent in routine clinical practice where test results may not be immediately confirmed due to resource and other implications. Thus diagnostic evaluations carried out in routine practice may mask the accuracy of tests.