1 Introduction

The term “disease burden” refers to the financial, medical, or socio-economic impact of a disease or health problem [1]. Researchers in public health frequently measure the burden of various diseases or health problems across different geographic locations or at different time points, for purposes such as assessing population health, evaluating the effectiveness of interventions, formulating health policies, and planning future resource allocation.

There is no consensus on the best measure of disease burden; the choice often depends on individual value or specific need. One common measure is financial cost. It summarizes the direct and indirect costs due to illness, which can be nontrivial for the low-income population. For example, Paez et al. examined the out-of-pocket expenses, which are the economic burden for patients and their family, for more than 100 chronic conditions in both adults and children. What they revealed was that the annual out-of-pocket expenses increased by 39.4% from 1996 to 2005 in the US, after inflation adjustment [2].

Another measure of disease burden is mortality rate. It counts the number of deaths due to a specific medical condition in a particular population, scaled to the size of that population, in unit time. In one study on the correlation between diabetes and ischemic heart disease (IHD), Laing et al. found that young adult women with diabetes were more than 8 times more likely to die of IHD than those without diabetes were. Similar trends were observed among young adult men, older men, and older women; and patients with Type I diabetes were found to have a relatively higher IHD mortality rate than patients with Type II diabetes [3].

By contrast, morbidity rate describes the frequency with which a disease occurs in a population and is often calculated by incidence rate and prevalence rate. Incidence rate refers to the proportion of newly diagnosed cases of a disease in a population, while prevalence rate accounts for both newly diagnosed and pre-existing cases of a disease. Corbett et al. found that worldwide, among 0.9 million cases of newly diagnosed adult cases of tuberculosis (TB) in 2000, 9% were attributable to HIV. In selected African countries and United States, 31% and 26% of TB cases were attributable to HIV, respectively. However, TB led to about 11% of adult deaths from AIDS [4]. This study indicated the comorbidity of TB and HIV, and highlighted the need for a targeted intervention strategy in countries with a high prevalence of HIV and TB.

A more sophisticated measure of disease burden is Disability Adjusted Life Years (DALYs). It is defined as the Years Lived with Disability (YLDs) plus the Years of Life Lost (YLLs) owing to a disease or health problem. Both YLDs and YLLs are age-weighted to reflect productivity and societal investment (e.g., years lived as a young adult are valued more than years spent as a young child or older adult). DALYs is the primary measure of disease burdens developed for the most comprehensive worldwide observational epidemiological study to date – the Global Burden of Disease Study [5], in which researchers have been estimating DALYs among populations of different ages, sex, and countries for more than 200 diseases and causes of death since 1990. Several developed countries, including the Netherlands [6] and Australia [7], use DALYs to survey and compare their nationwide burden of diseases for public policymaking.

Obviously, each of these established measures of disease burden has its own limitations. The financial cost of a disease, for instance, does not reflect health-related quality of life and untreated cases [8]. Mortality rate does not capture the disease burden prior to death [9], and in practice, it is often difficult to determine the actual cause of death as it is often the consequence of multiple diseases or injuries [10]. Morbidity does not adjust for the severity and impact of diseases. DALYs require a large amount of time and resources to calculate. This situation has led to many pandemic and rare diseases being left unstudied, and made it barely possible to compare the disease burden across a large number of diseases over time [11, 12]. Further, the estimates of disease burden from different studies sometimes conflict with each other. The prevalence of Parkinson’s disease in Spain was reported to be 1.5%, 0.6%, and 0.2% in 1994, by three separate groups [13,14,15].

In recent years, new data from the Internet have revealed novel utility in different fields. For instance, Ginsberg et al. used some search query keywords describing influenza-like illness on Google to predict influenza epidemics, as they were highly correlated with the actual influenza prevalence data reported by Centers for Disease Control and Prevention [16]. Similarly, Moat et al. identified correlations between the stock prices of 30 Dow Jones Industrial Average component companies and weekly Wikipedia page review data, and were able to increase their portfolio return by 65% using Wikipedia page review data instead of conventional strategies to build prediction models [17]. Therefore, we investigated whether mining these new data sources, primarily Google Trends and Wikipedia page review data, would allow the estimation of disease burden for a large number of diseases in an automated and cost-efficient way. Specifically, we examined the alignment of disease burden in terms of disease prevalence and financial cost for 1,633 diseases over 11 years with Google Trends and Wikipedia data. We also applied the least absolute shrinkage and selection operator (LASSO), a regression method that accomplishes variable selection and regularization, to predict the burden of diseases, using the Internet data along with other variables that we quantified in a previous study [18] for four specific diseases.

2 Data and Methods

2.1 Data Collection

Google Trends and Wikipedia are two publicly available data sources that record searching and browsing activities related to various diseases and health conditions on the Internet. On Google Trends (https://www.google.com/trends/), users enter one to five key words to retrieve their relative search volume. The upper-right panel of Fig. 1 shows the output of querying “breast cancer,” “obesity,” “acne,” “headache,” and “anemia” in the interactive user interface. The x-axis gives the timeline and the y-axis gives the normalized search volume in percentages, where the denominator is the highest search volume among all queried terms in the given time frame (e.g., in Fig. 1, the highest search volume is for breast cancer around October 2012). Google Trend also allows users to specify the geographic location, time period, data source category (i.e., Arts & Entertainment, Books & Literature, Health), and the type of search (i.e., web search, image search, news search), or to use their Application Programming Interface (API) for batch queries. In our experiment, these parameters were set to worldwide, from 2004 (earliest available year for Google Trends) to 2014, all categories, and web search, respectively. Further, we developed a two-step strategy to retrieve the relative search volume for 1,633 diseases from 2004 to 2014. As shown in the left panel of Fig. 1, the first step was to find the disease with the highest search volume during the defined time framework among all our diseases of interest and set it as the baseline disease. Thereafter, we categorized all the diseases into 5-disease groups, with the baseline disease inserted into each group, and queried Google Trends again (shown on the right panel, Fig. 1). Hence, the normalization denominator for each group was the same. Wikipedia provides a simpler API that allows us to download the weekly page review counts of each disease term from 2008 to 2014. We computed the annual Wikipedia review counts by adding up all 52 weeks of a year.

Fig. 1.
figure 1

Strategy to retrieve relative search volume from Google Trends for 1,633 diseases.

2.2 Disease Nomenclature

When diseases or medical conditions are mentioned in different online contexts, they can be abbreviated, exhibit various morphological or orthographical variations, or have multiple synonyms. For example, medical professionals refer to stroke as cerebrovascular accident, cerebrovascular insult, or brain attack. To ensure the completeness and consistency of the query results, we used the metathesaurus of the unified medical language system (UMLS) [19], which unifies more than one million medical concepts and five million names from more than a hundred biomedical controlled vocabularies and terminologies. For each of the 1,633 diseases of interest, we queried Google Trends and Wikipedia using all of its synonyms and defined its search volume as the highest search volume among all its synonyms.

2.3 Benchmark Data on Disease Burdens

We obtained the benchmark data on disease burdens during 2004 and 2010 from our previous study [18], and the data for 2011 to 2014 from a large medical claims database—MarketScan®—offered by Truven Health. Using the UMLS, we set the disease terminology for our analysis to be PheWAS codes as they represent clinically meaningful phenotypes with appropriate granularity [20]. More specifically, for 1,633 diseases defined by PheWAS codes from medical claims databases with non-zero Internet or disease burden data, we calculated the relative prevalence and the relative treatment cost, together with the relative number of publications and the relative number of clinical trials using the method introduced in our previous study [18]. The “relative” treatment cost, for instance, is defined as a given disease’s treatment cost divided by the total treatment cost of all the 1,633 diseases. This way, different factors become unitless and comparable.

2.4 Analytic Method

We denote the relative search volume of disease i in year j on Google as \( G_{i, j} \), where \( \mathop \sum \nolimits_{i = 1}^{n} G_{i,j} = 1. \) Thus, the vector \( G_{, j} \), which can be extended as (\( G_{1, j} ,G_{2, j} ,G_{3, j} , \ldots ,G_{n, j} \)), represents the relative search volume of all n diseases in year j on Google, and the vector \( G_{i, } \), which represents (\( G_{i, 1} ,G_{i, 2} ,G_{i, 3} , \ldots G_{i, m} \)), denotes the relative search volume of disease i in all m years of interest on Google. Similarly, we define the relative page review counts of disease i in year j on Wikipedia as \( W_{i, j} \), the relative prevalence of disease i in year j as \( P_{i, j} \), and the relative treatment cost of disease i in year j as \( C_{i, j} \).

To determine whether the information from Google Trends and Wikipedia can approximate the burden of diseases from three dimensions, we first examined the correlations between the Internet data and disease burdens measured by relative prevalence and relative treatment cost for all the diseases of interest as a whole. We did so by computing the Pearson correlation coefficients of (\( G_{, k} ,P_{, l} \)) and (\( G_{, k} ,\,C_{, l} \)) for years from 2004 to 2014 and the Pearson correlation coefficients of (\( W_{, k} ,P_{, l} \)) and (\( W_{, k} ,\,C_{, l} \)) for years from 2008 to 2014. We also computed Spearman Rank correlation coefficients and the p values, to test the null hypothesis that the Internet data is not correlated with those disease burden measures.

Second, we determined whether the Internet data could forecast the disease burden during the same year, one year later, and two years later on an individual disease level. Mathematically, for each disease i, we computed the Pearson correlation coefficients between the relative search volume on Google and relative disease prevalence \( (G_{i, } ,P_{i, } ) \) (\( G_{i, } ,\,\tilde{P}_{i} , \)), (\( G_{i, } ,\,\mathop P\limits^{ \approx }\!{_{i}} , \)), between the relative page reviews on Wikipedia and relative disease prevalence (\( W_{i, } ,\,P_{i, } \)), (\( W_{i, } ,\,\tilde{P}_{i} \)), (\( W_{i, } ,\mathop P\limits^{ \approx } \!{_{i, }} \)), between the relative search volume on Google and relative treatment cost (\( G_{i, } ,C_{i, } \)), (\( G_{i, } ,\tilde{C}_{i, } \)), (\( G_{i, } ,\,\mathop C\limits^{ \approx }\!{_{i}} \)), and between the relative page reviews on Wikipedia and relative treatment cost (\( W_{i, } ,C_{i, } \)), (\( W_{i, } ,\tilde{C}_{i, } \)), \( (W_{i, } ,\,\mathop C\limits^{ \approx }\!{_{i, }} ) \), where \( \tilde{P}_{i, } = P_{i, 2} ,P_{i, 3} ,\,P_{i, 4} , \ldots ,\,P_{i, m + 1} \) \( \mathop P\limits^{ \approx } \!{_{i, }} = (P_{i, 3} ,P_{i, 4} ,P_{i, 5} , \ldots P_{i, m + 2} ) \), \( \tilde{C}_{i, } = (C_{i, 2} ,C_{i, 3} ,C_{i, 4} , \ldots ,C_{i, m + 1} ) \), and \( \mathop C\limits^{ \approx } \!{_{i, }} = (C_{i, 3} ,C_{i, 4} ,C_{i, 5} \ldots C_{i, m + 2} ) \).

Finally, we used a LASSO-based regression model to predict the relative disease burden \( (\tilde{P}_{i, } ,\tilde{C}_{i, } ) \) or using \( G_{i, } ,W_{i, } \), and three other variables introduced in our previous work [18], namely the relative number of scientific articles from PubMed (\( L_{i, } \)), relative number of clinical trials (\( T_{i, } \)), and relative funding from the NIH (\( F_{i, } \)). LASSO is more powerful than traditional linear regression as it uses variable (feature) selection and regularization [21]. The diseases we chose are viral hepatitis, diabetes mellitus, other headache syndrome, and multiple sclerosis, whose relative burdens demonstrated the biggest correlations with relative search volume on Google and relative page review on Wikipedia in the second step. All these computations were performed in the R programming environment, in which LASSO is simulated by the “glmnet” package [22].

3 Results

First, for all the diseases as a whole, we analyzed the correlations between disease burdens, measured by relative disease prevalence (\( P_{, l} \)) and relative treatment cost (\( C_{, l} \)), and the relative search volume on Google at different years. Table 1 lists the Pearson correlation coefficients. The coefficients in Table 1(a) are all greater than the corresponding values in Table 1(b), indicating that the prevalence of diseases is more correlated to search volume than treatment cost. This can be explained by the definitions of those two measures. The treatment cost of a disease equals to its prevalence times the average treatment fees for each patient with the disease diagnosis in a given year. When all diseases are evaluated as a whole, the treatment cost estimate will have a larger variation than disease prevalence, therefore reducing its correlation with the relative search volume data on Google.

Table 1. The correlations between relative search volume on Google (\( G_{, k} \)) and relative disease burdens (\( P_{, l} \) and \( C_{, l} \)) during 2004-2014

We were also interested in the relationship between the relative search volume on Google in a given year t and the relative prevalence of a disease in year t-1 (the highlighted area under the diagonal lines in Table 1), as we initially assumed that individuals search the Internet once they receive a diagnosis. However, we did not observe such a trend. This might be owing to the fact that not all patients with a certain diagnosis will search the Internet and not all people who search for a particular disease on the Internet are diagnosed patients, or the fact that the computation of prevalence includes both newly diagnosed and pre-existing cases. An observable trend is that the Pearson correlation coefficients in the diagonal and right under the diagonal increase slowly with time, despite a few downward instances during 2009 and 2011. Such a weak increase suggests that it is becoming increasingly common to search the Internet for health-related topics.

We also tested the null hypotheses that cor \( (G_{, k} ,P_{, l} ) = 0 \) and co \( (G_{, k} ,C_{, l} ) = 0 \) and found that all the computed p values were less than 0.05. Therefore, we concluded that the relative search volume on Google and the relative disease prevalence (or the treatment cost) are unlikely to be uncorrelated. The correlations between the relative page reviews on Wikipedia and relative disease burdens showed similar trends during 2008 and 2014. The p values of the hypothesis testing (https://cci-hit.uncc.edu/resources/rqJan2017TableS1.xlsx) and correlation analyses between the relative page reviews on Wikipedia and relative disease burdens (https://cci-hit.uncc.edu/resources/rqJan2017TableS2.docx) can be found on our website.

Overall, the correlation coefficients between relative search volume on Google (or relative page reviews on Wikipedia) and relative disease burden measures are small—all are less than 0.35. We thus assessed the correlations between relative search volume on Google (G i ) and relative disease burdens (P i, C i, ) with 0-year, 1-year, and 2-year intervals for individual diseases. Filtering by p < 0.05 on all the six correlation coefficients left 60 diseases. Twenty-one diseases that had high correlations owing to missing values in either Google Trends or disease burden data were then excluded, and the remaining 39 diseases and their Pearson correlation coefficients are listed in Table 2. Black and white values refer to positive and negative correlations, respectively.

Table 2. 39 diseases demonstrate strong correlations between relative search volume on Google and disease burden measured by relative prevalence and relative treatment cost

In Fig. 2, we also plotted the correlation patterns for four representative diseases. Figure 2(A) shows that viral hepatitis is becoming less and less popular in Google

Fig. 2.
figure 2

The correlations between relative search volume on Google (solid lines) and relative disease prevalence (dotted lines) and treatment cost (dashed lines), for (A) viral hepatitis, (B) diabetes mellitus, (C) other headache syndromes, and (D) multiple sclerosis.

Search, which corresponds to its decreasing prevalence and treatment costs. Figure 2(B) shows that diabetes mellitus is searched less and less frequently on Google, but both its prevalence and treatment cost are increasing with time. This might indicate that as a chronic condition, diabetes mellitus requires long-term treatment but is underestimated by the public. “Other headache syndromes” in Fig. 2(C) exhibits a rising popularity in Google Search, but both its prevalence and treatment cost went down from 2004 to 2014. According to our communication with clinicians, one reasonable explanation is that headache is underdiagnosed as many people do not seek medical consultation for headache. Instead, patients simply turn to the Internet for information. In Fig. 2(D), the relative search volume for multiple sclerosis on Google aligns well with its prevalence but the treatment cost has been rising dramatically, possibly owing to the increase in the cost of medication, which occurred in the same period [23].

Finally, we explored whether the relative search volume on Google (\( G_{i, } \)), relative page review on Wikipedia (\( W_{i, } \)), relative disease prevalence (\( P_{i, } \)), relative treatment cost (\( C_{i, } \)), and three other variables we quantified in our previous study [18], namely the relative number of scientific articles from PubMed (\( L_{i, } \)), relative number of clinical trials (\( T_{i, } \)), and relative funding from the NIH (\( F_{i, } \)) for year t could predict the relative disease prevalence (\( \tilde{P}_{i, } \)) or relative treatment cost (\( \tilde{C}_{i, } \)) for year t + 1, using LASSO for each of the 39 diseases we identified in the previous step. Figure 3 shows the LASSO cross validation curves and variable selection results for the treatment cost prediction of sleep apnea, hemorrhoid, disaccharidase deficiency, and diabetes mellitus. With the shrinkage of lambda (bottom horizontal axis; log scale), mean-squared error (MSE, left vertical axis) decreases until the minimum value (close to 0 in Fig. 3) is reached at the left vertical line. The right vertical line gives the optimal model where the error is within one standard deviation from the minimal MSE. The correlation coefficients and intercept of the fittest model are listed in each panel. It seems that not all five variables are related to treatment cost prediction in each case, but the relative treatment cost from the previous year is most useful, which is consistent with our previous findings [18]. We repeated the analysis for relative disease prevalence (\( \tilde{P}_{i, } \)) (https://cci-hit.uncc.edu/resources/rqJan2017FigS1.jpeg). The results confirmed that the predictive powers of the aforementioned factors vary in accordance with each case.

Fig. 3.
figure 3

LASSO cross validation curves and estimated coefficients of four diseases. (A) Sleep apnea, (B) Hemorrhoids, (C) Disaccharidase deficiency, and (D) Diabetes mellitus.

4 Discussion and Conclusions

In the present study, we investigated the correlation between search volume on Google and page view counts on Wikipedia with disease burden, measured by prevalence and treatment cost, for 1,633 diseases over an 11-year-period. Our analysis revealed that disease prevalence is more strongly correlated to search volume on Google and page view counts on Wikipedia than treatment cost. A relatively stronger correlation exists for 39 out of 1,633 diseases, including viral hepatitis, diabetes mellitus, other headache syndromes, multiple sclerosis, sleep apnea, hemorrhoids, and disaccharidase deficiency. However, the relative search volume on Google and page view counts on Wikipedia for different diseases displayed different correlation patterns with their prevalence and treatment costs. Further, the LASSO regression analysis showed that the relative search volume on Google, relative page review on Wikipedia, relative disease prevalence, relative treatment cost, relative number of scientific articles from PubMed, relative number of clinical trials, and relative funding from the NIH have various power for predicting future disease burdens. However, our analysis is limited to prevalence and treatment cost, but not other measures of disease burden due to data availability and comparability. The findings also caution us not to over-generalize when estimating disease burdens for the purpose of understanding population health, formulating health policies, or planning resource allocation. Instead, we should consider each individual disease according to its characteristics, such as the acute/chronic nature, severity, familiarity to the public, and presence of stigma.