INTRODUCTION

The National Cancer Institute (NCI) states that the rate of new cases of female breast cancer was 124.9 per 100,000 women per year, while the rate of deaths was 21.2 per 100,000 women per year for 2010–2014.1 An extensive study was done on trends and patterns of disparities in cancer mortality among US counties.2 In 2014, there were an estimated 3,327,552 women living with female breast cancer in the USA. In an article on cancer mortality trends for 1980–2014 in which 29 types of cancers were studied at the county level, it was found that breast cancer mortality rates statistically declined over that time period, which is consistent with the NCI’s findings.3 Multiple studies have identified several risk factors for female breast cancer including screening,1 smoking,4 diabetes,5 obesity,5,6 physical inactivity,7 educational attainment,8 race,8 poverty,9 and environmental factors such as air pollution.10,11,12 Additional risk factors, such as age at menarche, age at menopause, breast feeding, parity, and percent foreign born, could also be considered if such data can be obtained for all counties in the contiguous USA, but it may be very difficult or impossible to obtain data on all risk factors.13

The use of heat maps for breast cancer rates can be quite useful to show the geographic differences descriptively, but it is even more useful and important to add a cluster analysis based on modern disease surveillance algorithms by which some groups of adjacent counties with high cancer rates can be identified as a “clusters.” Such a cluster analysis also tested for the statistical significance of each identified cluster in order to weed out random groupings of counties. Identifying significant cancer clusters allows researchers to focus on geographic areas that need to be studied further to better understand why a certain grouping of adjacent counties display unusually high cancer incidence or mortality or both. Not all counties that are located within a high cluster may have high cancer rates. A cluster identified with the software SaTScan™ can identify a “disease outbreak” and it can warn Health Departments about such a possibility.14,15,16 A follow-up cluster analysis using only counties in a (large sized) cluster can identify small-sized hotspots, if desired.

The main goal of this project is to study any significant spatial clustering and space-time of female breast cancer mortality and spatial clustering of incidence of female breast cancer mortality between 2000 and 2014 in the contiguous USA. Any possible association of breast cancer clustering to county-level risk factors including age, smoking, race, physical inactivity, carcinogenic air pollution, education, and urban living is investigated. Clusters of counties with high breast cancer burdens are identified. Specifically, we aim to answer the following five research questions:

  1. 1.

    Are there geographical areas where the age-adjusted breast cancer mortality rate is significantly higher than that in the rest of the USA?

  2. 2.

    Are there geographical areas where the age-adjusted breast cancer incidence rate is significantly higher than that in the rest of the USA?

  3. 3.

    Is there a significant space-time interaction for breast cancer mortality in the contiguous USA?

  4. 4.

    Is there a significant association between the breast cancer mortality rate and any of the studied covariates?

  5. 5.

    Is there a significant association between breast cancer incidence rate and any of the studied covariates?

DATA AND METHODS

Data for age-adjusted mortality used in this study were downloaded from the IHME, which used small area estimation for any missing data.17 No mortality data was missing for any county for the time period 2000–2014. The total sample size was n = 46,620, which is based on 3108 counties for 15 years. The IHME data are based on spatially explicit Bayesian mixed effects regression models for cancer mortality.18 The breast cancer mortality data includes ICD9 (174-175.9, 217-217.8, 233.0, 238.3, 610-610.9) and ICD10 (C50-C50.929, D05-D05.92, D24-D24.9, D48.6-D48.62, D49.3, N60-N60.99). The incidence data from the Centers for Disease Control (CDC) is based on the SEER Site Recode ICD-0-3 codes C500-C509 definitions. Since breast cancer is not a rare type of cancer, no data was suppressed by the CDC. There were some missing data for some counties (around 3%), and we estimated the missing counts with the corresponding state rates so that the cluster analysis would not be much affected by such estimation.

The data set that we obtained from IHME had no missing mortality data, but the estimated mortality counts were non-integers. While the normal model in SaTScan™ could be used for any continuous data, counties with small populations would have the same weight as counties with large populations.19 Therefore, a weighted normal model was used with the disease surveillance software SaTScan™. The weighted normal model assigns different weights based on the county population.20 The incidence counts were gathered from the CDC (2017), and were analyzed with a Poisson model in SaTScan™ after age adjusting with the Statistical Analysis System (SAS™), using Poisson Regression in proc glimmix. The GIS-based software ArcMap™ was then used to create the cluster maps.

Data on several covariates (smoking, diabetes, obesity, PM 2.5 air pollution, drug poisoning deaths, inactivity, poverty, urban living, race, and educational attainment) at the county level were obtained from the CDC,21,22,23 National Center for Health Statistics,17 United States Census Bureau,24,25,26 and United States Department of Agriculture.27 Such risk factors are not independent of each other, and we trimmed down the number of used covariates based on the association with breast cancer and we deleted from the list of covariates that were confounded with each other, such as drugs poisoning, obesity, and poverty. Using a correlation analysis with the software SAS™, the covariates were analyzed, checked for potential confounding.

In addition to age adjusting, mortality was then adjusted for smoking, particulate matter PM 2.5 air pollution, inactivity, and race using linear regression in SAS™. The residuals obtained from the fit of mortality on the covariates (age, smoking, physical inactivity, PM 2.5 air pollution, race) are basically the mortality adjusted for these covariates. Linear regression was used due to the IHME mortality data estimates not being integers. We attempted to round up/down the county mortality estimates to be able to use the Poisson model, but the rounding would greatly distort the resulting heat maps and also the cluster analysis. As for incidence counts, these were adjusted for smoking, air pollution, urban living, and education using Poisson regression (proc glimmix) in SAS™.

The resulting residuals from the regression are used in SaTScan™ for a spatial cluster analysis adjusted for the above selected covariates. The residuals are analyzed with a purely spatial cluster analysis, followed by using ArcMap™ to create cluster maps. Basically, “residuals” represent the information left on the cancer cases after modeling out all information on the specific covariate used in the regression model from the cancer variable. We identify clusters with information that is beyond what the covariate can account for. In the case of cancer incidence, the Poisson regression predicts the counts. This is then followed by analyzing the predicted covariate-adjusted incidence counts with SaTScan. www.satscan.org explains this methodology for the Poisson model. The raw incidence counts are used in the case file while the predicted counts from the regression analysis are used in the population file. The resulting clusters are clusters based on covariate-adjusted incidence. Using the data collected on mortality, a graph using the trend analysis software JOINPOINT™ was then created to check for any changes in the slope for female breast cancer mortality since 2000. The more sophisticated space-time analysis was also done with SaTScan for the breast cancer mortality. Since the CDC averages incidence data for 5 years at a time, we could not do a space-time analysis for breast cancer incidence data.

RESULTS

After running the correlation analysis on the available covariates, four covariates (in addition to age) for both mortality rate and incidence rate were chosen. Figure 1 gives a purely spatial map of age-adjusted breast cancer mortality rates where the four significant clusters are shown with (red) circles based on the SaTScan™-weighted normal model results. Each county is colored based on the age-adjusted breast cancer mortality rates, in addition to identifying (with red circles) the significant clusters. The Southeast cluster also shows a clustering around the Mississippi River, starting from New Orleans. The second cluster covers part of the East Coast, plus West Virginia. Table 1 lists some additional information on these spatial clusters.

Figure 1
figure 1

Purely spatial clusters of age-adjusted breast cancer mortality rate for 2000–2014.

Table 1 Breast Cancer Cluster Details

The purely spatial cluster analysis was followed with a space-time analysis to address the question whether there exist geographical areas where there exists a breast cancer mortality space-time interaction. As shown in Table 1, there are two significant space-time clusters that resemble closely the two purely spatial clusters shown in Figure 1. The NE space-time cluster identifies the years 2000–2006 as the time period for the increase in breast cancer mortality in this geographic location. The normalized mortality rate inside this cluster is 0.67 standard deviations above the normalized national average (of zero). The SE space-time cluster was identified for the time period 2008–2014, with a mean of 0.94 inside the cluster.

The covariate-adjusted breast cancer mortality rates are shown in Figure 2, where the orange counties were present in a high age-adjusted breast cancer cluster when no other covariates were used. Such clusters are not associated with any of the covariates used (smoking, air pollution, inactivity, and race) and yet they persisted and stayed in place inside the original clusters shown in Figure 1. There seem to exist other factors or covariates not studied here that may have led to such high breast cancer mortality rates. Such counties would require health agencies to look deeper into the high cancer rates. The red counties used to exist inside a high cluster in Figure 1, but these counties disappeared from the cluster after adjusting for the selected covariates. This means that in these red counties, we identify high age-adjusted breast cancer mortality that is associated with the selected covariates. In other words, the cancer rates are “explained” by the covariates. On the other hand, the yellow counties are counties that display high cancer rates only after adjusting the original rates for the selected covariates. Such counties appeared in Figure 1 as if their breast cancer mortality rates were not too high, but once adjusted for the covariates, these counties now display elevated breast cancer mortality rates. From Figure 2, the cluster in Missouri with three counties was explained by the covariates. Both the cluster in Indiana-Illinois and near the mid-Atlantic coast shifted while the southern cluster decreased dramatically.

Figure 2
figure 2

Purely spatial clusters of covariate-adjusted breast cancer mortality rate for 2000–2014.

As for the incidence of breast cancer, a Poisson regression model is used to predict incidence counts (in SAS™). The resulting predicted counts are used in SaTScan™ Poisson model. Figure 3 gives a cluster map of breast cancer incidence for the USA. While some incidence clusters seem to be located in similar locations as some of the mortality clusters, the incidence map (Fig. 3) appears to be very different looking from the corresponding mortality map (Fig. 1). The mortality map had its clusters along the eastern part of the USA, while the incidence map shows clusters along the eastern coast and all along the northern border of the USA. It is useful to compare cluster maps based on breast cancer mortality with the corresponding maps for incidence since each type of cancer data can be associated with a different population SES. Table 1 gives details on the clusters.

Figure 3
figure 3

Purely spatial clusters of age-adjusted breast cancer incidence rates for 2000–2014.

Figure 4 shows the role of the selected covariates (smoking, educational attainment, urban living, and air pollution) in adjusting the breast cancer incidence rates based on the results from the SaTScan™ Poisson model using the incidence data that has been adjusted for the selected covariates. Only the red-colored counties have high breast cancer incidence rates that can be associated with the chosen covariates. The clusters in the northern Midwest seemed to have merged together as well as the two clusters near the Pacific Northwest. The two clusters along the eastern coast seemed to have shifted and merged as well. The two clusters in Florida were not affected by the chosen covariates. The orange-colored counties have high breast cancer incidence rates that are not associated with the chosen covariates. There may exist some (unknown) factors that are associated with the high cancer incidence in such counties. The yellow counties were initially not inside any cluster before adjusting for the covariates.

Figure 4
figure 4

Purely spatial clusters of covariate-adjusted breast cancer incidence rate for 2000–2014.

In addition to the purely spatial analysis, we also analyzed the trend of the mortality rate data to address Research Question 5. Using JOINPOINT™ software and the age-adjusted mortality data, a “joinpoint” is identified in 2008.28 The rate of the decreasing mean normalized mortality rate changed. The breast cancer mortality rate used to be decreasing by 0.36 (per 100,000) each year for 2000–2008, changing to only a decline of 0.05 (per 100,000) for 2008–2014.

DISCUSSION

This is the first study of purely spatial breast cancer clusters in the contiguous USA for incidence in addition to including a purely spatial and a space-time cluster analysis of breast cancer mortality. This study provides an opportunity to contrast mortality clusters with incidence clusters, while carefully taking into account the possibility of several confounded effects when trying to shed light on why some clusters appeared on the map. Using SaTScan™, clusters of breast cancer mortality and incidence are identified for the contiguous USA, showing the role of the used covariates on such rates. While Figures 1 and 2 show mortality clusters to be located only in one part of the USA, incidence clusters are more spread out in several parts of the contiguous USA (Figs. 3 and 4). Even after adjusting for known covariates and risk factors, several clusters were present, which could be attributed to factors outside the scope of this project and this should be investigated further. For example, we did not directly study breast cancer screening. It is well known that the frequency of mammograms in counties with low SES is lower than that in counties with high SES, and there will be a confounded effect with county-level poverty, say. Counties with low SES have lower rates of screening for breast cancer, resulting in lower incidence rates and higher mortality rates.29,30 Some other possible risk factors that were not studied here include parity, percent foreign born, age at menopause, and breast feeding. As our study focused at the county level, localized effects will not be found using data at this geographic resolution. Using smaller geographic units such as census tracts, data may be more effective in finding impacts at a smaller geographic level if needed. While the incidence data from the CDC was reported as averages of 5 years of counts, it is possible that 1 year may have had an unusually high or low rate for one particular county. Such spikes will not be apparent in this incidence data. Further analysis using trends in mortality rate shows the year 2008 being a pivot year where the rate of the decreasing mortality rate slowed. This shift in mortality rate should be investigated further. The space-time cluster analysis of mortality rates reveals that the NE cluster has an increase in breast cancer mortality rates for the years 2000–2006, while the SE cluster has an increase for the years 2008–2014. Note that the space-time interaction for the SE cluster coincides with the JOINPOINT in 2008 that is based on national breast cancer mortality rates. The increase in the SE of the USA may have been the main reason why from 2008 the decrease in the national breast cancer mortality rates slowed down. It is worthwhile to look deeper into the identified differences in the geographic locations of the significant breast cancer mortality clusters versus the corresponding incidence clusters. Why are these clusters so different? Some possible explanations are (i) geographical variation in breast cancer screening: counties with higher SES may have a higher health insurance rate, which results in higher rate of visits to the physician for mammograms. Such screening will result in a higher cancer incidence but not necessarily in a higher mortality, (ii) geographical variation in the breast cancer type: a higher proportion of more aggressive cancers may lead to clusters of mortality but not necessarily to clusters of incidence, (iii) geographical variation in breast cancer treatment: it is possible that better treatment leads to lower mortality without affecting incidence, (iv) geographical variation in the concentration of women from ethnical groups with a higher probability of having a genetic predisposition for breast cancer, and (v) geographical variation in parity or hormone use.

In answer to our research questions, we conclude that the female breast cancer incidence and mortality data analysis in this study supports the association with certain covariates. However, there are certain geographic areas that appear to have incidence and mortality rates beyond the simple association with the studied covariates, as identified as significant clusters. These geographic areas warrant further investigation to potentially identify additional local concerns or needs to further address female breast cancer in those specific sites. In general, counties with high rates of smoking, blacks, physically inactive, and PM 2.5 air pollution also have high breast cancer mortality. Counties with high rates of smoking and college education have high incidence rates, but counties with high rates of urban living and PM 2.5 air pollution have low incidence rates.

Limitations of the Study

(i) While all counties in the contiguous USA had data on cancer mortality, 3% of the counties had missing incidence data which had to be estimated. The accuracy of the data may be a limitation factor. (ii) Not all possible risk factors were used. For example, the data did not include information on stage of diagnosis. (iii) US counties were the smallest geographical units used in this study.