Background

Depression is a prevalent and disabling condition in both high and low income countries [13]. According to the World Health Organization, depression is the 4th most disabling medical disorder, and is predicted to be the 2nd most disabling medical condition by 2020 [1, 4]. The 12-month prevalence of depression has been reported as 4.1%, with a lifetime prevalence of 6.7% [5].

Treatment guidelines developed in high income countries (HIC) recommend routine screening for depression in primary health care (PHC) as an initial step in holistic patient care [68]. A number of brief (≤12 items) instruments including the patient health questionnaire (PHQ-9) [9, 10] and the Kessler-10 (K-10) [11] have been validated in low and middle income countries (LMIC). Similarly, longer (≥15 items) instruments including the centre for epidemiological studies-depression (CES-D) [12] have also been validated in LMIC.

The bulk of research summarizing findings about the accuracy of validated depression screening instruments has come from HIC, providing conflicting data [1315]. For example, one review found marginal differences between brief and ultra-brief scales [14], while a meta-analysis by Mitchell et al. (2007) reported that brief and ultra-brief scales were equally accurate [15].

Generalizing findings from studies conducted in HIC to LMIC may be inappropriate due to a number of differences. Low literacy rates, cultural diversity and high patient numbers are some factors that are unique to LMIC [3, 16, 17]. Such differences as low literacy rates may influence the accuracy of depression screening instruments, making the generalization of findings from HIC to LMIC the more difficult.

Depression is a major health problem across LMIC; however, a number of countries in sub-Saharan Africa are equally plagued with a high burden of HIV/AIDS. Indeed close to two thirds of all persons living with HIV/AIDS (PLWHA), reside in sub-Saharan Africa [18]. Research has also shown that up to 30% of PLWHA may develop depressive disorder during the course of their illness [19, 20].

The screening of depression among PLWHA is important for a number of reasons; the presence of symptom overlap between the two disorders being one of them. For example, suicide, fatigue, sadness and insomnia are symptoms reported by both PLWHA and those with depression. The existence of symptom overlaps call for screening PLWHA who present at PHC for depression. Indeed a number of researchers have recommended the routine screening of depression in PLWHA [2124]. However, literature about the validity of screening instruments in the setting of HIV/AIDS remains scanty [25].

The aim of our systematic review was to examine the accuracy of depression screening instruments which have been validated in LMIC, comparing brief and long scales. We also compared the accuracy of instruments validated in general and HIV-PHC settings.

These findings could guide clinicians about which scales to adapt for routine use in busy PHC settings within LMIC.

Methods

A literature search was conducted using the following approach:

We searched the PUBMED, COCHRANE library, AIDSLINE, and PSYCH-Info databases for studies published in English from inception up to July 2011. In our search, we used the following key words: sensitivity/specificity, validation, depression/depressive disorders, and screening instruments/tools/scales. These key words were combined with LMIC, HIV/AIDS, Africa, Asia, Eastern Europe, and South America. We then searched reference lists from retrieved articles for suitable papers and consulted two sets of authors [26, 27] for more clarity regarding data in their papers.

Study selection

Studies were included if they had the following outcomes of interest:

  1. 1.

    A depression screening instrument followed by a formal diagnostic instrument or an interview was administered to all screened patients i.e. both screen positive and negatives.

    The diagnosis of a depressive disorder(major/minor/dysthymia) was based on the ICD-10 [28], DSM-IV[29], or an instrument frequently used as a gold standard. Instruments routinely used to screen for depression including the [30, 31] were not considered gold standard, even though a number of studies had used them [25, 32].

  2. 2.

    Studies were conducted in non-mental health facilities

  3. 3.

    Studies reported the sensitivity, specificity, the AUC and predictive values of the screening instrument in comparison to the diagnostic standard.

  4. 4.

    Studies were conducted in LMIC as defined by the world bank [33].

Data analysis

Data from included studies was extracted by one author (DA) into tables constructed in MS Excel, and later transferred to RevMan version 5.1.2 [34]. We used RevMan to construct a diagnostic 2x2 table by calculating the true positive, false positive, false negative and true negative figures from the sensitivity/specificity and prevalence values provided in all the included studies. The figures from the 2x2 tables generated using RevMan were then fitted in STATA version 11.2 [35] to assess for heterogeneity using random effects analysis model. Assessing for heterogeneity guided us, as to whether it was possible to pool, analyze, and report the findings as a meta-analysis. We used meta-analytic commands in STATA for the analysis.

Study quality assessment and inclusion

Data was independently abstracted by three authors (DA, EO and TA). DA read all the abstracts, 1151 studies were excluded based on abstracts alone. Full articles for 65 articles were identified for further scrutiny. Of the 65 articles identified for further scrutiny, 14 studies in which 19 instruments were validated with 3759 participants met our criteria. See Figure 1.

Figure 1
figure 1

Study selection process for the systematic review.

Study inclusion and exclusion was independently done by DA, EO and TA, in the event of ambiguity, DJS was the arbitrator. We used RevMan to assess study quality. The parameters assessed included blinding of reference information from screening results, screening of patients from highly selected populations, and selection of who gets the gold standard from among a screened population. Study quality was rated as fair, acceptable and good quality. All included studies were then scrutinized independently by JJ.

Results

Of the 19 included studies, 10 fulfilled all the reporting criteria by RevMan [30] and were considered of good quality [26, 3642].One study was considered fair in quality due to the lack of blinding and referral of only screen positives for the diagnosis from a highly selected population [11]. The rest of the studies (n=8) were considered acceptable. The studies with acceptable quality had limited information about blinding, some lacked clarity about the time interval between administration of the screening instrument and gold standard [27, 4347].

General description of studies

Eleven studies were conducted in Africa [11, 26, 27, 38, 4043, 47], five of which were in HIV settings [26, 27, 38, 41, 43]. Two studies were conducted in South America [36, 37] and six in Asia [39, 4446] The most frequently used diagnostic instrument was the mini international neuropsychiatric instrument (MINI) [48]. Table 1 below shows the general characteristics of the studies. The sample sizes of included studies ranged from 61 to 649. The prevalence of depression varied widely across populations ranging from 11.1 to 53.5% (see Table 2 below). There were also wide variations within continents, and also according to the different instruments used. All validated instruments were able to adequately identify depression, with AUC ranging from 0.69-0.99. Table 2 above shows the variables that were used to assess for heterogeneity.

  1. a)

    The BDI-SF, 1instrument

    Leticia et al. (2005) [36] validated the BDI-SF validated among 155 patients admitted to general medical wards in Brazil. The gold standard was based on the ICD-10 [28].

  2. b)

    K-6, 1 instrument.

    Tesfaye et al. (2009) validated the K-6 in 100 post natal women attending a general PHC clinic in Ethiopia. A psychiatric interview based on the DSM-IV [29] was used as the gold standard.

  3. c)

    K-10, 4 instruments

    The K-10 was validated at four PHC sites, one of which was an HIV PHC site. Fernandes et al. (2011) [45] validated the K-10 among 194 pregnant mothers at a rural prenatal clinic in India. Meanwhile Spies et al. (2009) [27] validated the K-10 in 429 HIV-infected adults in an HIV care centre in South Africa using the MINI as the gold standard. Baggaley et al. (2007) [11] validated a translated version of the K-10 in Burkina Faso among 61 women. A detailed diagnostic interview by a psychiatrist within 3 days of administering the K10 was the gold standard. Tesfaye et al. (2009) validated the K-10 in 100 post natal women attending a general PHC clinic in Ethiopia. A psychiatric interview based on the DSM-IV [29] was used as the gold standard.

  4. d)

    PHQ-9, 1 instrument

    The English language version of PHQ-9 was translated into Thai by Lotraku et al. (2008) [39], then back translated and adapted for use in Thailand. The PHQ-9 was then validated among 280 participants in a general PHC setting in Thailand.

  5. e)

    EPDS, 5 instruments.

    The EPDS was the most validated instrument in both pre and postnatal women. However, it should be noted that women accessing antenatal and postnatal care predominantly seek help for pregnancy related complaints, and may differ from persons attending general PHC. Despite such differences in the reason for seeking help at PHC, studies report a 10-20% prevalence of depression in postnatal women [4951]. This high prevalence calls for the need to screen for depression in this population. We also report about these studies because such findings could be of interest to persons involved in women’s mental health research.

    Fernandes et al. (2011) [45] validated the EPDS among 194 women in their third trimester of pregnancy at a rural prenatal clinic in Karnataka India. The gold standard against which the EPDS was validated was the ICD-10. In mainland China, Lau et al. (2010) [44] validated the Chinese version of the EPDS in 342 postnatal women, using the Structured Clinical Interview for DSM-III-R (SCID) [52] as gold standard.

    In Zimbabwe, Africa, Chibanda et al. (2010) [43] validated the Shona version of EPDS scale among 210 postpartum HIV-infected and uninfected women attending two primary care clinics in peri-urban Harare, Zimbabwe. In Brazil, Figeuira et al. (2009) [37] validated the EPDS in a sub-sample of 245 mothers; the MINI was used as the gold standard.

    Tesfaye et al. (2009) validated the EPDS in 100 post natal women attending a general PHC clinic in Ethiopia. A psychiatric interview based on the DSM-IV [29] was used as the gold standard.

  6. f)

    Other brief (3) instruments

    Puertas et al. (2004) [46] validated a visual analogue scale (VAS) and the GHQ-10 among 450 participants in India using the revised Clinical Interview Schedule (CIS-R) [53] as a gold standard. The CIS-R is based on the ICD-10 [28].

    In Uganda, Muwhezi et al. (2007) [47] assessed the validity of a 4- item subjective well-being subscale (SWB) in detecting a major depressive illness. A total of 199 consecutive patients were enrolled at a PHC facility in Uganda, interviewed using the SWB and the MINI [48] as a gold standard.

    Table 1 General description of the studies included in the systematic review
    Table 2 Parameters used to asses for heterogeneity of included studies

Longer scales

  1. a)

    CES-D, 2 instruments

    In Zambia, Africa, Chisanga et al. (2011) [38] conducted a cross-sectional study in 16 primary level care clinics and validated the CES-D in PLWHA who had tuberculosis and were starting ART. Chisanga validated the CES-D against the MINI [48] as gold standard.

    Myer et al. (2008) [26] validated the CES-D among 465 participants individuals had enrolled into HIV care in South Africa. He used the MINI as gold standard.

  2. b)

    SRQ-20, 1 instrument

    In Malawi, Stewart et al. (2009) [40] validated the Chichewa version of the Self Reporting Questionnaire (SRQ) was validated among 114 subjects at a PHC site. This instrument went through a process of forward and back translation.

  3. c)

    Other long instruments

    Kaaya et al. (2002) [41] validated the Hopkins Symptom Checklist-25 (HSCL-25) among 99 women who were pregnant and HIV positive in Tanzania. The gold standard was the SCID [52].

Analysis for the presence of heterogeneity between studies

We used the ‘meta’ commands of STATA to generate the forest plots and assess for heterogeneity. The test for heterogeneity using a random effects analysis model yielded a statistically significant result. Heterogeneity chi-squared = 189.23, p = 0.000 on 18 degrees of freedom.

Statistically significant heterogeneity meant we could not continue with the meta-analysis and report the results as pooled estimates.

Discussion

We present the first systematic review comparing the accuracies of brief and long depression screening instruments which have been validated in LMIC settings. In this review, we found evidence to show that within LMIC, a number of depressed patients are identified using screening instruments at PHC settings. The prevalence figures reported in the included studies also vary widely across PHC settings within LMIC.

We found statistically significant heterogeneity between studies and could not conduct a meta-analysis to the end. The heterogeneity across studies could be the result of methodological differences in validation of instruments. For example, we found that a single instrument could be validated using different reference standards, producing different cut off scores and AUC scores. The CESD and EPDS were such examples in our review [26, 38, 43, 45]. In addition, these studies were conducted across continents and settings with different cultures, languages and resources.

Both brief and longer scales showed moderate to high accuracy, with AUC ranging from 0.69-0.99. Our review found evidence to show that brief scales including the PHQ-9, BDI-SF, K-6, K-10, EPDS, and GHQ-12 were as accurate as the longer ones like the CES-D, HSCL, and BDI. These findings are in agreement with previous reviews which assessed the accuracy of depression screening instruments in HIC [6, 14]. For example, a review of instruments validated in the Spanish language reported overall sensitivity and specificity in the range of 70-90% [13]. Studies with AUC’s values of 0.50 to 0.70 are generally considered of low accuracy, 0.70 to 0.90 as having moderate accuracy, and those with AUC ≥ 0.90 as highly accurate [54, 55]. Of the instruments studied, the EPDS shows acceptable accuracy in detecting depression among pre and post-natal women, which was in agreement with a previous systematic review [50]. Among HIV clinic populations, the HSCL-25 [41] showed the highest sensitivity at 89%.

No single instrument was superior to another in our review, perhaps due the relatively small number of studies with any particular instrument. Previous reviews that have assessed diagnostic accuracy of depression instruments were equally unable to recommend a single instrument for use in PHC [15, 50].

Limitations

A number of limitations should be acknowledged. For example, we did not include studies that were not published in English. That said, our literature review did not return any studies in other languages that appeared to meet our inclusion criteria. While some studies published in non-indexed journals may have escaped notice, there has been an increase in indexed journals in LMIC in recent years, and most studies of quality should therefore have been captured.

Secondly, we didn’t include in our review instruments which had been used to screen for the whole range of psychiatric morbidity, limiting our scope to those that had been validated for depression only. The inclusion of such scales which had screened for both depression and anxiety disorders could have been more informative; however, such criteria could have turned up numerous studies which may have been difficult to synthesize. Much as the K-10, GHQ and SRQ-20 instruments asses for common mental disorders including anxiety, depression and psychological distress, we only included them if they had been used to screen for depression.

Conclusion

Brief instruments are as accurate as the longer ones in detecting depression in both general and HIV-PHC settings. The brief nature of a screening instrument (BDI-SF, PHQ-10, and K-10) gives it the edge over longer scales like the CES-D due the short duration in which it can be administered. However, the fact that ultra-brief scales such the K-6 and BDI-SF don’t encompass a whole range of depressive symptoms including suicide, the use of such scales needs to be followed up with detailed psychiatric diagnostic interviews. The K-6 was shown to be as accurate as the K-10 in the study by Tesfaye et al. (2009).

Other scales such as the EPDS may be the instrument of choice in particular populations (e.g. postnatal mothers).