Evaluation of hotspot cluster detection using spatial scan statistic based on exact counting

  • Fumio IshiokaEmail author
  • Jun Kawahara
  • Masahiro Mizuta
  • Shin-ichi Minato
  • Koji Kurihara
Original Paper Computational statistics and machine learning


In this paper, we propose a novel approach to the detection of spatial clusters based on linkage information of a map dataset. Spatial scan statistic has been widely used for detecting a hotspot cluster (or a coldspot cluster) in various fields, such as astronomy, biosurveillance, natural disasters, and forestry. This approach is based on the idea of finding a connected regional subset that maximizes likelihood in the whole study area. To detect a hotspot cluster, which aggregates high-risk regions so as to be maximum likelihood, we only just search such a cluster from all patterns of connected regional subsets. However, except when there are extremely few regions of the study area, since the total number of connected regional patterns usually becomes enormous, we cannot investigate all of them. This means that we have not been able to know whether a detected hotspot which is obtained under certain rules, such as using the previous studies, has the truly maximum likelihood within a given study area. A zero-suppressed binary decision diagram (ZDD), one approach to frequent item set mining, enables us to extract all of the potential cluster regions at a realistic computational load. In this study, we propose a hotspot detection method using ZDD-based enumeration, and apply it to sudden infant death syndrome in North Carolina. This completely new method enables us to detect a true hotspot cluster that has the truly maximum likelihood. To evaluate our proposed method, we compare the properties of that with existing methods such as flexible scan and echelon scan, and discuss their suitability for different purposes of detecting hotspot.


Spatial cluster detection Spatial scan statistic Echelon analysis Zero-suppressed binary decision diagram 

1 Introduction

Detecting where a problem occurs, such as the generation status of infective diseases or hazard maps of natural disasters, is very basic and important to elucidate the causes of the problem and to take measure against environmental preservation or safety management. Currently, it is becoming easier to analyze the various types of spatial data and express them visually on a map, coupled with a dramatic advance in geographical informationsystems due to the growing sophistication of hardware or network technologies. For example, a statistical map with shading, such as a choropleth map, can be used to show how quantitative information varies geographically. However, such a map provides us with only an evaluation of the height difference in each individual region through equivocal visual information, and it is still difficult to estimate the location of spatial clusters based on statistical evidence.

Several studies dealing with various types of spatial data have been conducted to detect spatial clusters. Besag and Newell (1991) separated cluster detection tests into two categories, focused and general. A focused test detects whether there are clusters around pre-specified point sources, such as nuclear installations and incinerators. A general test targets clusters over the study area. Furthermore, general tests are classified as involving global or local statistic. A global statistic is designed to evaluate whether or not there are spatial clusters in the study area. For example, Moran’s I statistic (Moran 1948), based on spatial autocorrelation, Cuzick–Edward’s test (Cuzick and Edwards 1990), based on a kind of k-nearest neighbors method, and Tango’s index (Tango 1995), based on a factor of data and a measure of closeness between regions, have been proposed. A local statistic is used to detect the locations of clusters. For example, Anselin (Anselin 1995) proposed a local Moran’s I statistic that detects cluster regions from the perspective of spatial autocorrelation. Openshaw et al. (1987) and Besag and Newell (1991) attempted cluster detection using subregions based on a predetermined rule. Tango’s index (2000), extended as a local test, has also been used in the field of spatial epidemiology.

Recently, the spatial scan statistic (Kulldorff 1997) has been widely used for cluster detection together with the freely available SatScan™ software (Kulldorff 2018) and applied in such fields as astronomy, biosurveillance, natural disasters, and forestry. It is commonly used to evaluate the statistical significance of temporal and geographical clusters without requiring any prior assumptions about their location, time period, or size. In addition, some models of spatial scan statistic have been proposed, such as the Bernoulli (Kulldorff and Nagarwalla 1995), Poisson (Kulldorff 1997), ordinal (Jung et al. 2007), exponential (Huang et al. 2007), normal (Huang et al. 2009; Kulldorff et al. 2009), and multinomial (Jung et al. 2010) depending on the feature of data. These statistical approaches detect a hotspot or a coldspot cluster based on the likelihood ratio (LR) associated with the number of events inside and outside a connected regional subset, called a window. To detect a cluster with high LR, it is desirable to scan and calculate the LR for every possible window and evaluate them. However, unless the number of regions in the study area is extremely small, it is very difficult to cover all possible patterns of window, because there are a huge number of those in general. Kulldorff and Nagarwalla (1995) proposed using a circular-shaped window, but it has been pointed out that a non-circular-shaped cluster, such as the shape formed by a river or a road, cannot be detected by that means. To capture an arbitrary shaped cluster, several scanning techniques using non-circular-shaped window have been proposed (Duczmal and Assunção 2004; Kurihara 2004; Patil and Taillie 2004; Tango and Takahashi 2005).

In this paper, we propose a novel scanning method using a zero-suppressed binary decision diagram (ZDD) (Minato 1993) from the perspective of cluster detection of truly maximum LR. The ZDD is an approach to frequent item set mining that enables us to extract all possible patterns of window at a realistic computational load. By applying this technique to cluster detection, we can exactly determine a true hotspot or coldspot cluster, that is, a spatial cluster with the truly maximum LR. Specifically, when we limit the maximum number K of regions in a cluster, we can prove that the obtained cluster has the truly maximum LR among windows with at most size K, and thus, we can say with confidence that it is waste to find a region (with at most size K) with higher LR. As we will discuss in Sect. 5, compared to the existing methods, our method is suitable for finding a spatial cluster with small size but high LR.

Section 2 defines a hotspot cluster and introduces the spatial scan statistic and several valid existing scanning methods. Section 3 describes a ZDD. In Sect. 4, we demonstrate the detection of true hotspot cluster for a simple artificial data. In addition, we try to apply the ZDD technique to a real data, that is, we strictly calculate the number of connected regional patterns and detect a true hotspot cluster for North Carolina sudden infant death syndrome (SIDS) data. Furthermore, we compare it with some hotspots obtained using existing scanning methods and discuss their suitability for different purposes of detecting hotspots. In Sect. 5, we evaluate our proposed method by clarifying the properties of each scanning method and conclude the paper with a discussion of our work.

2 Definition of hotspot cluster and methods for detecting hotspot

2.1 Test statistic

The spatial scan statistic is a popular method used in disease surveillance for the detection of disease clusters. Let us assume that a study area is divided into m regions. The idea of this method is to find a connected regional subset, called window \({\mathbf{Z}}\), which can be a candidate for a spatial cluster. The cluster is identified based on an LR between null and alternative likelihoods calculated with an appropriate probability model. In this study, we choose the traditional spatial scan statistic for the Poisson model consisting of the observed and expected number of cases. The hypotheses for the presence of a hotspot cluster that is a cluster with higher observation than expectation are expressed as
$$\begin{aligned}&H_0 : \mathrm{Expectation\phantom {0}of\phantom {0}} O({\mathbf{Z}}) = E({\mathbf{Z}}) \\&H_1 : \mathrm{Expectation\phantom {0}of\phantom {0}} O({\mathbf{Z}}) > E({\mathbf{Z}}), \end{aligned}$$
where O() and E() denote a random number of cases and the expected number of cases, respectively, within the specified window \({\mathbf{Z}}\). (For a coldspot detection, it suffices to reverse the inequality sign of \(H_1\).) In other words, under the null hypothesis of no clusters in the study area, an observed number in region i can be stated as
$$\begin{aligned} O_i \sim \mathrm{Poisson}(E_i),\phantom {00}i=1,2,\ldots ,m\ \end{aligned}$$
where \(O_i\) and \(O_j\) are independent (\(i \ne j\)). For the location and size of each scanning window, the LR statistic is calculated by
$$\begin{aligned} \mathrm{LR}({\mathbf{Z}}) = \left( \frac{o({\mathbf{Z}})}{E({\mathbf{Z}})} \right) ^{o({\mathbf{Z}})} \left( \frac{o({\mathbf{Z}}^c)}{E({\mathbf{Z}}^c)} \right) ^{o({\mathbf{Z}}^c)} I(o({\mathbf{Z}})>E({\mathbf{Z}})), \end{aligned}$$
where o() denotes the observed number of cases in the specified window \({\mathbf{Z}}\), and \({\mathbf{Z}}^c\) indicates all of the regions outside the window \({\mathbf{Z}}\). I() is the indicator function. For computational simplicity, the logarithm of LR (LLR) is typically used instead of the ratio itself. In this article, a hotspot cluster is defined as a window \({\mathbf{Z}}\) with the maximum LLR. One of the advantages of this cluster detection approach is that it does not require any prior assumptions about cluster location or size, because it scans the entire area using a variable window.

What matters here is how effectively and efficiently we scan and find a window \({\mathbf{Z}}\), whose LLR is maximum in the entire study area. It is usually impossible to conduct a complete investigation of all possible patterns of window, because their number is expected to increase explosively if the study area has a large number of regions. Kulldorff and Nagarwalla (1995) first proposed imposing a circular scanning window on the study area. This allows the center of the circle to move over each centroid of the region, and then, for each circle centroid, allows the radius of the circle to vary from zero to a previously user-defined limit. (As a default, the window never includes more than 50% of the total population.) Their method is available in SaTScan™ software (Kulldorff 2018), and it is widely used for hotspot detection in various fields. However, since this technique uses a circular-shaped window to scan, it has difficulty in correctly detecting non-circular-shaped hotspots, such as the shape formed by a river or a road. To detect arbitrarily shaped hotspots, several non-circular scanning techniques have been proposed. In this paper, we focus on the flexible scan (Tango and Takahashi 2005), whose software is available for free and echelon scan we have been proposed.

2.2 The flexible scan and the restricted LR

The flexible scan (Tango and Takahashi 2005) imposes an irregularly shaped window \({\mathbf{Z}}\) on the study area. To do this, we first need to decide on a maximum size K of regions to be included in the hotspot. For any given region i, the collection of irregularly shaped windows consisting of k connected regional subset including the region i is created. Here, we let k range from one to the pre-defined maximum K in order of distance from the region i. Therefore, the collection of scanned windows is given by all k connected regions from one to the “\((K-1)\)-nearest” neighbors. A hotspot in this method is defined as the window \({\mathbf{Z}}\) with the highest LLR in the collection mentioned above. As is clear from the procedure, this approach requires both location and neighbor information. However, this method sets a feasible limitation of \(K=30\) for searching clusters, because its algorithm is based on a particular conditional search for all possible patterns of window, which requires an unrealistic computational load if m is large. To solve this problem, Tango (2008) and Tango and Takahashi (2012) proposed an another spatial scan statistic that restricts scanning regions according to the risk of each region. Under the Poisson assumption, the restricted LR for a specific window \({\mathbf{Z}}\) is calculated by
$$\begin{aligned} \mathrm{LR}({\mathbf{Z}}) = \left( \frac{o({\mathbf{Z}})}{E({\mathbf{Z}})} \right) ^{o({\mathbf{Z}})} \left( \frac{o({\mathbf{Z}}^c)}{E({\mathbf{Z}}^c)} \right) ^{o({\mathbf{Z}}^c)} I(o({\mathbf{Z}})>E({\mathbf{Z}})) \prod _{i \in {\mathbf{Z}}}I(p_i<\alpha _1), \end{aligned}$$
where \(p_i\) is the one-tailed p value of the test for \(H_0\) and is given by the middle p value:
$$\begin{aligned} p_i = \Pr \{ O_i \ge o_i +1 | O_i \sim \mathrm{Poisson}(E_i) \} +\frac{1}{2} \Pr \{ O_i = o_i | O_i \sim \mathrm{Poisson}(E_i)\} \end{aligned}$$
and \(\alpha _1\) is the pre-specified significance level for the individual region. This restricted LR takes each individual region’s risk rate into account, and thereby enables us to scan only the regions with primarily elevated risk. Tango and Takahashi (2012) reported that this new test statistic has better properties in that the running time is quite fast, and it eliminates the constraint of less than \(K=30\) for searching the cluster candidates. Software for the flexible scan and the flexible scan with restriction is provided by FleXScan (Takahashi et al. 2010).

2.3 The echelon scan

As another scanning method, we introduce echelon scan (Ishioka et al. 2007; Ishioka and Kurihara 2012). This method searches for a hotspot by moving the scanning window in a particular manner derived from echelon analysis. Echelon analysis (Myers et al. 1997; Kurihara 2004) divides the study area into structural entities consisting of peaks or foundations. As an example, suppose that the data have ten regions labeled from (A) to (J), as shown in Fig. 1(left), and that each region holds its own value. In addition, suppose that they are in a neighboring situation with some other regions, as shown in Table 1.
Fig. 1

Sample of regional data (left) and peaks (right)

Table 1

Neighbor information and value for each region

Region name


Neighboring regions



B, D, E



A, C, E



B, E, F



A, E, G, H



A, B, C, D, F, H, I, J



C, E, J



D, H



D, E, G, I



E, F, H, J



E, F, I

For such regional data, we can systematically describe a peak construction. Since the region with maximum value in the data is D (\(=10\)), {D} is regarded as the first peak. Next, since the region with maximum value in the neighboring regions of {D} is G (\(=8\)), {G} is included in the first peak. In addition, the region with maximum value in the neighboring regions of {D, G} is H (\(=4\)). However, {H} is not greater than I (\(=9\)), which is a neighboring region of {H}. Consequently, the first peak consists of {D, G} only. Using the same procedure, we can find three further peaks, {B}, {F}, and {I}, respectively. (see Fig. 1(right)) Furthermore, each region other than the peaks can be regarded as a foundation for each peak. In this sample data, the region with the maximum value except for the four peaks is J (\(=6\)). Here, {J} is neighboring to the regions {I} and {F}, which are already assigned as peaks, i.e., an upper level structure. Thus, {J} can be regarded as the foundation of the peaks {F} and {I}, and then, we can group {F, I, J} together. Similarly, {H} becomes the foundation of {D, G} and {F, I, J}, and furthermore, {E} becomes the foundation of {B} and {D, F, G, H, I, J}. Accordingly, we can represent the echelon dendrogram for this sample data as in Fig. 2.
Fig. 2

Echelon dendrogram for the sample data

Similar to as the flexible scan, we first define a maximum size of regions K to be included in the hotspot. Until the number of regions included in the window \({\mathbf{Z}}\) reaches the pre-defined number K, we let the scanning window move from the upper to the bottom structure of the dendrogram while incorporating the regions in the dendrogram into the window \({\mathbf{Z}}\). For example, in this sample data, if we choose 50% of the total regions as K, i.e., \(K=5\), then we can obtain the six window patterns consisting of {D}, {D, G}, {I}, {F}, {I, F, J} and {B}. In the collection of scanned windows (needless to say, each window is the connected regional subset), a window \({\mathbf{Z}}\) with the highest LLR is regarded as the hotspot.

3 ZDDs

In this section, we explain ZDDs and how to represent a huge number of windows using them. ZDDs were proposed by Minato (1993) as a compact data structure for representing a family of sets, and they have been used in such research fields as logic synthesis, symbolic model checking, and itemset mining. Recently, ZDDs have been applied to graph optimization problems, such as minimizing the loss for grid networks (Inoue et al. 2014), the longest path problem (Kawahara et al. 2017b), evacuation planning for disasters (Takizawa et al. 2013), and designing electoral systems (Kawahara et al. 2017c). The key idea of using ZDDs for graph optimization is to directly construct the ZDD representing all of the solutions of the problem and extract the optimal solution from the constructed ZDD. The method of ZDD construction is called frontier-based search (Sekine et al. 1995; Kawahara et al. 2017a). In what follows, we describe how to obtain and handle a huge number of windows using ZDDs and frontier-based search.

We formulate the enumeration of windows as a graph problem. We regard the study area as a graph in which each vertex corresponds to a county, and an edge joining two vertices indicates that the two corresponding counties are adjacent (see Fig. 3a, b). In graph theory, a window \({\mathbf{Z}}\) is considered as an induced connected component.1 Our goal is to enumerate induced connected components on a given graph. For example, there are 26 induced connected components on the graph in Fig. 3b, as shown in Fig. 3c.
Fig. 3

a Example of an area divided into five regions. b Corresponding graph representing the region. c Induced connected components on the graph

We introduce ZDDs and explain how to interpret them. For example, we represent a family \(\mathcal {F} = \{\{x_1, x_2, x_3\},{} \{x_1, x_2, x_4\}, {} \{x_1, x_3, x_4\},{} \{x_2, x_4\},{} \{x_3, x_4\}\}\) as the complete binary tree shown in Fig. 4a. The binary tree expresses the case division of whether each element \(x_i\) is used. The solid and dotted arcs of a node with label \(x_i\) indicate that \(x_i\) is used or not used, respectively. Each leaf of the tree has a value of zero or one. The value one indicates that the corresponding set is included in the family represented by the tree. A path from the root node (located at the top) to the leaf with value one corresponds to a set in the family. Figure 4c shows all five paths on the complete binary tree and the corresponding sets. We can say that the complete tree contains all of the information regarding the family \(\mathcal {F}\). We can compress the complete binary tree without losing information, as shown in Fig. 4b. In the same manner as in the case of the complete binary tree, a path from the root to a leaf with value one corresponds to a set in the family, as shown in Fig. 4c. We call the directed acyclic graph in Fig. 4b a ZDD. The exact definition of ZDDs and how to compress the complete binary tree as a ZDD are described in, e.g., Knuth (2011).
Fig. 4

a Complete binary tree representing \(\mathcal {F}\). b ZDD representing \(\mathcal {F}\). c Paths from the root node to the leaf with value one on the ZDD

Next, we explain how to construct the ZDD representing a huge number of induced connected components. Fixing an input graph, we can identify an induced connected component with a set of edges and identify a set of components with a family of edge sets. Therefore, we can represent it as a ZDD. Figure 5 shows the ZDD representing 26 induced connected components on the graph in Fig. 5.
Fig. 5

Representation of induced connected components as edge sets and the ZDD containing them

Frontier-based search is a method for directly constructing a ZDD representing subgraphs we would like to obtain when an input graph is given. It can treat graph structures such as paths (Knuth 2011), trees (Sekine et al. 1995), matchings (Kawahara et al. 2017a), and graph partitions (Kawahara et al. 2017b). We can impose various conditions on obtained subgraphs, such as the number of edges, the connectivity of specified vertices, the existence (or non-existence) of a cycle, and the degrees of vertices. A detailed explanation of frontier-based search is provided in Kawahara et al. (2017a). The number of subgraphs that frontier-based search can treat is huge. For example, Kawahara et al. (2017a) reported that the method succeeded in constructing the ZDD representing \(8.32 \times 10^{33}\) spanning trees on a \(9 \times 9\) grid graph with 81 vertices in 67.1 seconds, while an existing algorithm that outputs spanning trees one by one did not finish for a \(6 \times 6\) grid graph with 36 vertices in 1000 s.

Here, we provide a brief overview of frontier-based search. The method constructs a ZDD in a top–down and breadth-first manner. For example, since the children of the nodes (a) and (b) in Fig. 6 are the same, we can merge the two nodes. Storing information in ZDD nodes enables us to decide whether the nodes can be merged before constructing their children, avoiding the duplication of computation.
Fig. 6

Merging two nodes of a ZDD. If a and b are not merged, the complete same children are constructed. The frontier-based search detects such nodes without creating their children and merges them

The frontier-based search for our problem, that is, enumerating induced connected components, is very similar to that used for graph partitioning in Kawahara et al. (2017c). That search obtains graph partitions by dividing an input graph into a specified number of connected components. We show the difference between graph partitioning and an induced connected component in Fig. 7. For graph partitioning, we create the specified number of connected components, while for an induced connected component, the number of connected components is one. By slightly modifying the frontier-based search for graph partitioning, we can obtain an algorithm for constructing a ZDD for induced connected components. Once we obtain the ZDD, it is easy to count the exact number of connected components represented by the ZDD using a dynamic programming-based algorithm (Knuth 2011). To obtain the maximum LLR, we need to extract connected components one by one and compute LLR for them. It is easy to extract connected components from the ZDD, but the computational time is proportional to the number of connected components.
Fig. 7

Difference between graph partitioning and an induced connected component. Graph partitioning is allowed to have two or more connected components

4 Illustrative example

4.1 Numerical example

To show an example, where different hotspots are detected depending on the selected scanning method, we consider the artificial data consisting of 30 lattice regions \((m=30)\) labeled from (A) to (Z) and (a) to (d) as shown in Fig. 8(left). Let each region \(i \, (i=1,2, \ldots , 30)\), have two variables, that is, the observed number of cases \(o_i\) and population \(n_i\). For simplicity, we assume \(n_i = 10,000\) for all \(i = 1, 2, \ldots , 30\). As neighbor information, we give “Rook contiguity” to each region, i.e., only the regions sharing the sides are neighbors to each other. (The regions sharing the vertices are not neighbors.) We show the results of hotspot detection for this using four scanning methods: ZDD-based scan, flexible scan, flexible scan with restriction and echelon scan. Here, we set to \(K = 15\) that means up to 50% of the total number of regions. For calculation of LLR for each window \({\mathbf{Z}}\), we calculated the expected number of cases in region i as
$$\begin{aligned} E_i = n_i\times \frac{\sum {o_i}}{\sum {n_i}},\phantom {00}i=1,2,\ldots ,m. \end{aligned}$$
In this data, \(E_i = 66.67\) for all \(i = 1, 2,\ldots , 30\).
Fig. 8

Artificial data consisting of 30 lattice regions (left) and its true hotspot (right). The values in each region show the observed number of cases

In this study, a window \({\mathbf{Z}}\) such that the LLR becomes maximum in every possible window is defined as a true hotspot cluster, and such \({\mathbf{Z}}\) is exactly determined by the ZDD-based scan. Under \(K=15\), we detected the true hotspot cluster consisting of 13 regions {A, C, E, F, G, H, I, J, M, T, X, Y, b} shown in Fig. 8(right). On the other hand, the results of the existing scanning methods under \(K=15\) are shown in Fig. 9. The hotspot cluster detected using flexible scan (Fig. 9(left)) has the highest LLR in all possible patterns of window that is limited to “a region located at the center of the scanned window and 14 regions located closer to the center region”. When applying the flexible scan imposing the restriction of \(\alpha _1=0.20\) to this data, the regions {A}, {C}, {G}, {H}, {J}, {X}, {Y}, {b} which satisfied \(o_i \ge 72\), derived from equation (3), are targeted for scan, and as a result, the hotspot shown in Fig. 9(center) was detected. Figures 9(right) and 10 show the result of echelon analysis and the detected hotspot based on the dendrogram, respectively. In the echelon scan, the LLR became maximum when we scanned up to three peaks ({H, G}, {J, K} and {A}), which contain clearly high observations, and their foundations ({C} and {I}) whose observed values are relatively low. The results of each detected hotspot are summarized in Table 2. Unlike the existing methods, the great feature of ZDD-based scan is that there is no constraint on the scanning process, which makes it possible to purely detect a hotspot with truly maximum likelihood. On the other hand, it should be noted that a ZDD-based hotspot may contain regions which have observations smaller than expected value.
Fig. 9

Detected hotspots using existing scanning methods with \(K=15\): flexible scan (left), flexible scan with restriction (center) and echelon scan (right)

Fig. 10

Echelon dendrogram for the artificial data and the detected hotspot based on echelon scan

Table 2

Results of detected hotspot cluster and its LLR for the artificial data using each scanning method with \(K=15\)

Scanning method

Window \({\mathbf{Z}}\) identified as hotspot

\(\mathrm{LLR} \,({\mathbf{Z}})\)

ZDD-based scan

A, C, E, F, G, H, I, J, M, T, X, Y, b


Flexible scan

G, H, O, U, V, X, Y


Flexible scan with restriction

A, C, G, H


Echelon scan

A, C, G, H, I, J, K


4.2 Application to real data

4.2.1 SIDS data in North Carolina

Using the ZDD-based scan, we detect a true hotspot cluster for the well-known spatial data set, i.e., Sudden Infant Death Syndrome (SIDS) in North Carolina. Sudden infant death is defined as the sudden unexplained death of a child less than one year of age. The data consist of the number of SIDS cases and the number of live births from 1974 to 1984 for each of 100 counties of North Carolina. There were 1503 deaths and 752,354 live births in total during that period. The overall incidence rate is 2.00 per 1000 live births. The geographical distribution of the SIDS rates for each county is shown in Fig. 11. As a typical data set, SIDS data have been used for evaluations of various approaches to spatial data analysis, such as cluster detection, spatial mixture modeling, and Bayesian or kriging mapping (Cressie and Chan 1989; Cressie 1992; Kulldorff 1997; Lawson and Clark 2002; Berke 2004).
Fig. 11

County numbers for each 100 counties of North Carolina and the geographical distribution of SIDS rates for each county

As an introduction to the hotspot detection of North Carolina’s SIDS data, we introduce the application reported in Kulldorff (1997). He continuously varied the radius of the circular scanning window from zero to a maximum radius with the window never including more than 50% of the total live births. We replicated his work using SaTScan™ software, with the result of the detected hotspot consisting of the counties {9,24,47,78,83} in the southern part of the state with \(\mathrm{LLR}=25.38\). This is shown in Fig. 12a. Here, we calculated the number of expected SIDS cases in county i as Eq. (4), where \(n_i\) is the number of live births in county \(i (i=1,2,\ldots ,100)\). The advantage of Kulldorff’s circular scanning method is that the power is high for a circular-shaped hotspot, such as when a certain infectious disease spreads concentrically, and the calculation load is low due to the simplicity of the algorithm. However, since the shape of scanning window is restricted to a circle, it cannot always be said that a detected hotspot has the truly maximum LLR compared to other hotspots with arbitrary shape.

4.2.2 True hotspot for SIDS in North Carolina

Enumerating all possible patterns of window for North Carolina’s 100 counties enables us to detect the true hotspot cluster of SIDS as defined above, and we can achieve that using the ZDD technique. Before detecting the true hotspot, let us first draw attention to their enumeration. Very few attempts have been made to count every potential window directly, but the proposed algorithm makes it clear that there are a total of 457,360,042,704,181,970,785,600,288 (of the order of \(10^{26}\)) patterns for North Carolina. The execution time and the memory utilization for constructing the ZDD were 74.19 s and 540 MB, respectively.2 The details of their breakdown for the number of connected regional patterns (window patterns) consisting of k counties are listed in Table 3. This table is the first result that strictly calculates and clarifies the number of connected regional patterns in North Carolina’s 100 counties. Although unfortunately, the “number” of all patterns can be calculated by ZDD as Table 3 shows, but in the current computer performance, it is very difficult to obtain the specific regional patterns with the maximum LR without extracting patterns from the ZDD one by one when the number of regions constituting the window is large. For North Carolina’s 100 counties, we were able to enumerate up to 11 actual regional patterns and calculate LLR for all of them.
Table 3

Number of connected regional patterns consisting of k counties for North Carolina 100 counties


# of connected regional patterns


# of connected regional patterns









































































































































































































To detect the true hotspot cluster, we have to specify a maximum size K of counties to be included in that. In this study, we choose seven different values \(K=5, 6, 7, 8, 9, 10\) and 11, and the results are summarized in Table 4. In case of \(K=5\) (see Table 4a), the true hotspot consisted of the five counties {9, 24, 47, 78, 83} with \(\mathrm{LLR}=25.38\) as same as the result of Kulldorff’s circular scan. With increase value of K by one, the county {9} was gone, and two counties {4, 77} were joined, and totally six counties were detected as the true hotspot with \(\mathrm{LLR}=26.79\) (Table 4b). With increase value of K by one more, the county {62} was newly added, and as the result seven counties were detected as the true hotspot with \(\mathrm{LLR}=29.00\) (Table 4c). In case of \(K=8\) or \(K=9\), the detected true hotspot had the same result for eight counties consisting of {4, 9, 24, 47, 62, 77, 78, 83} with \(\mathrm{LLR}=30.81\) (Table 4d). In case of \(K=10\), the detected true hotspot consisted of ten counties with \(\mathrm{LLR}=31.90\) (Table 4e), and unlike the previous trends, it lay on north and south of North Carolina. Furthermore, in the case of \(K=11\), the county {54} was added to the result of \(K=10\), the true hotspot with \(\mathrm{LLR} = 34.54\) was detected (Table 4(f)). Yet on the other hand, the execution time was about 1,237,800 s (about 14 days) when \(K = 11\) using a PC windows7 Intel(R), Core(TM) i7 CPU X990 (3.47 GHz) and 24 GB memory. Their maps for the detected true hotspots can be seen in Fig. 12a–f, respectively.
Fig. 12

Detected hotspots of SIDS in North Carolina. The county numbers for detected hotspot are indicated. a True hotspot (\(K=5\)), circular scan (Kulldorff 1997), flexible scan (\(K=5, 6, 7, 8, 9\)), flexible scan with restriction (\(K=5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 20, 30, 40, 50\)) and echelon scan (\(K=5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 20\)); b true hotspot (\(K=6\)); c true hotspot (\(K=7\)); d true hotspot (\(K=8, 9\)) and flexible scan (\(K=10, 11, 12, 13, 14, 15, 20\)); e true hotspot (\(K=10\)); f true hotspot (\(K=11\)); g flexible scan (\(K=30\)); h echelon scan (\(K=30\)); i flexible scan (\(K=40\)); j echelon scan (\(K=40\)); k echelon scan (\(K=50\))

Table 4

Details of the detected hotspots


# of counties






Hotspot (a)







Hotspot (b)







Hotspot (c)







Hotspot (d)







Hotspot (e)







Hotspot (f)







Hotspot (g)







Hotspot (h)







Hotspot (i)







Hotspot (j)







Hotspot (k)







The Roman letters corresponding to Fig. 12 are allocated to each of them. o: observed SIDS cases inside the hotspot; n: number of live births inside the hotspot; E: expected SIDS cases inside the hotspot; RR: relative risk calculated as the ratio for inside vs outside the hotspot; LLR: log likelihood ratio of the hotspot

4.2.3 Hotspot detection using existing scanning methods

In this section, we will apply existing scanning methods to SIDS data. The flexible scan, in the case of the maximum cluster size of \(K=5,6,7,8,9\), detected the same result as the true hotspot in the case of \(K=5\) (Table 4a). On the other hand, under any of the settings at \(K=10,11,12,13,14,15,20\), detected the same result as the true hotspot cluster obtained at \(K=8\) or \(K=9\) (Table 4d). In case of \(K=30\), it detected the hotspot consisting of 14 counties with \(\mathrm{LLR}=36.37\) (Table 4g). Furthermore, in case of \(K=40\), the detected hotspot consisted of 18 counties with \(\mathrm{LLR}=42.16\) (Table 4i). The flexible scan has a performance comparable to the ZDD-based all possible scan in terms of detecting a high likelihood cluster; however, its calculation load becomes a problem when we want to detect a large size cluster. For applying to North Carolina’s 100 counties using FlexScan v3.1.2 software, the execution time was about 2600 s when \(K = 30\), furthermore, about 1,645,300 s (about 19 days) when \(K = 40\) (using a PC windows7 Intel(R), Core(TM) i7 CPU X990 (3.47 GHz) and 24 GB memory).

However, a question arises here: is it not a problem that the hotspot includes the county of Richmond (county number 77), which is below the average in mortality rate? (The SIDS incidence rate of Richmond (77) is 1.88 per 1000 live births.) The reason why a region with a low rate is included in the hotspot is that the spatial scan statistic, as noted previously, is modeled by maximizing the LR, and therefore, it recognizes to encompass even if the neighboring regions have a non-elevated risk. This implies that there would be a danger of a mistaken identification, that is, we might detect two originally separate hotspots as a single hotspot. Tango’s restricted scan statistic provides one solution to this problem. In applying the restricted statistic, we selected \(K=5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 20, 30, 40\) and 50, respectively. In addition, we selected the pre-specified significance levels of \(\alpha _1=0.10, 0.20, 0.30,0.40\), respectively. Irrespective of the value of K and \(\alpha _1\), the flexible scan with restriction detected the same counties as the true hotspot cluster obtained by \(K=5\). This is obviously, because the low-risk county of Richmond (77) was removed from the scanned counties. Under all conditions we tried, the execution time of the restricted flexible scan was within 1 s.

Finally, we show the results of echelon scan using the dendrogram based on mortality rate shown in Fig. 13. We selected the maximum size of hotspot counties to be the same as was used in the restricted flexible method. When \(K=5,6,7,8,9,10,11,12,13,14,15, 20\), the result of echelon scanning process is described by Fig. 13a, and the detected counties match the true hotspot in the case of \(K=5\). A great advantage of the echelon scan is that the counties with low-risk belonging in the bottom hierarchical structure are more easily removed from the scanning process. In the echelon-based scan for SIDS in North Carolina, since Richmond (77) was located in the bottom part of the dendrogram compared with some other counties forming the peaks of the dendrogram, this county had a lower priority of scanning. As a result, Richmond (77) was not detected as a hotspot county. Subsequently, the scan results when \(K = 30, 40, 50\) are shown in Fig. 13h, j, k, respectively. As the value of K was set to high, the hotspot that was wide range and had the high likelihood could be detected, despite the time required for these calculations was within 1 s. Details of the detected hotspots of the echelon scan and their maps are shown in Table 4a, h, j, k) and Fig. 12a, h, j, k, respectively.
Fig. 13

Echelon dendrogram for SIDS in North Carolina based on the mortality rate per 1000 live births. The county numbers for detected hotspot are indicated on the dendrogram. a The location of Richmond (77) on the dendrogram and the detected hotspot based on echelon scan for \(K=5,6,7,8,9,10,11,12,13,14,15,20\); h\(K=30\); j\(K=40\); k\(K=50\)

5 Discussion

In this paper, we proposed a novel approach to detect a spatial cluster from the perspective of truly maximum LR using the ZDD technique. As an illustration, we used the SIDS data of North Carolina, which is often used as spatial data analysis. Firstly, we succeeded in strictly calculating the number of connected regional patterns consisting of 1–100 counties in North Carolina for the first time in the world. Then, we could obtain the actual regional patterns under the condition of 11 counties or less and detect the true hotspot clusters by calculating the LLR for all of them. This provides an important new insight that if we have the same data scale as introduced in this paper, we can examine a cluster with truly maximum LLR exhaustively through ZDD enumeration. In addition, we investigated the properties of existing scanning methods, including the flexible and the echelon scans. Table 5 summarizes that LLRs of detected hotspots for each method with change in the maximum cluster size K. As expected, the ZDD-based scan could always detect the hotspot with higher LLR than any other methods if we set the same K. Furthermore, in the case of \(K = 10\) and \(K = 11\), it was very interesting that the hotspots detected by ZDD-based scan, unlike previous trends, were located on the counties that lay on north and south narrowly as shown in Fig. 12e, f. These unique shaped hotspots, but the best LLR, can never be detected with the other scanning methods discussed in this paper.

When analyzing about 100 regions such as used in this study, we evaluate the properties of the hotspot detection for the proposed method and existing methods as follows.
  • To use the ZDD technique for scanning can always detect a hotspot cluster with the highest likelihood and that is ideal for small sized hotspot detection.

  • The original flexible scan works well in medium sized hotspot detection such as consisting of 20 regions or less, and the detected hotspot has a comparatively high likelihood.

  • The flexible scan with restriction and the echelon scan can detect a hotspot with arbitrary size without imposing a limitation of the maximum cluster size caused by high computational load. The former is most suitable for detecting a hotspot that does not include low-risk regions and the latter is able to obtain a hotspot with high likelihood.

Table 5

LLRs of detected hotspot for each method with change in the maximum cluster size K for SIDS in North Carolina


ZDD-based scan (true hotspot)

Flexible scan

Flexible scan with restriction (\(\alpha _1=0.10,0.20,0.30,0.40\))

Echelon scan












































































CD represents computational difficulty

What has to be noticed is, owing the nature of spatial scan statistic, that the LR of detected hotspot may be getting higher as a result of including a particular region “A” that combines two or more different clusters with high risk into one cluster, even if the “A” itself does not have a high risk. We need to determine carefully which one should be selected in “a single hotspot with the maximum LR” or “several separate hotspots with decent LR” by consideration of their background.

This paper discussed the hotspot cluster detection that focuses only on LR statistic, but of course the significance of the detected hotspot must be judged from the distribution of statistics. Monte Carlo hypothesis testing (Dwass 1957) is typically used to estimate p values, since it is difficult to obtain the exact distribution of the spatial scan statistic. However, we might be able to determine the truep value using all of the LR statistics calculated from every possible window obtained using ZDD. We consider this to be worthwhile future work.


  1. 1.

    An induced connected component is a subgraph in which every two vertices of the subgraph have an edge if the edge exists on the original graph.

  2. 2.

    We conducted this experiment on a machine with Intel Xeon E5-2630 (2.30 GHz) CPU and 128 GB memory (Linux Centos 6.6). We implemented the algorithm in C++ and compiled them using gcc with the -O3 optimization option.



This work was partly supported by JSPS KAKENHI Grant Numbers JP16K16019, JP18K04610, JP18H04091 and JP15H05711.


  1. Anselin, L. (1995). Local indicators of spatial association-LISA. Geographic Analysis, 27(2), 93–115.CrossRefGoogle Scholar
  2. Besag, J. E., & Newell, J. (1991). The detection of clusters in rare diseases. Journal of the Royal Statistical Society, Series A, 154(1), 143–155.CrossRefGoogle Scholar
  3. Berke, O. (2004). Exploratory disease mapping: Kriging the spatial risk function from regional count data. International Journal of Health Geographics, 3(1), 18.CrossRefGoogle Scholar
  4. Cressie, N. (1992). Smoothing regional maps using empirical Bayes predictors. Geographical Analysis, 24(1), 75–95.MathSciNetCrossRefGoogle Scholar
  5. Cressie, N., & Chan, N. H. (1989). Spatial modeling of regional variables. Journal of American Statistical Association, 84, 393–401.MathSciNetCrossRefzbMATHGoogle Scholar
  6. Cuzick, J., & Edwards, R. (1990). Spatial clustering for inhomogeneous populations. Journal of the Royal Statistical Society, Series B, 52(1), 73–104.MathSciNetzbMATHGoogle Scholar
  7. Duczmal, L., & Assunção, R. (2004). A simulated annealing strategy for the detection of arbitrarily shaped spatial clusters. Computational Statistics and Data Analysis, 45(2), 269–286.MathSciNetCrossRefzbMATHGoogle Scholar
  8. Dwass, M. (1957). Modified randomization tests for nonparametric hypotheses. Annals of Mathematical Statistics, 28(1), 181–187.MathSciNetCrossRefzbMATHGoogle Scholar
  9. Huang, L., Kulldorff, M., & Gregorio, D. (2007). A spatial scan statistic for survival data. Biometrics, 63(1), 109–118.MathSciNetCrossRefzbMATHGoogle Scholar
  10. Huang, L., Tiwari, R. C., Zuo, Z., Kulldorff, M., & Feuer, E. J. (2009). Weighted normal spatial scan statistic for heterogeneous population data. Journal of the American Statistical Association, 104, 886–898.MathSciNetCrossRefzbMATHGoogle Scholar
  11. Inoue, T., Takano, K., Watanabe, T., Kawahara, J., Yoshinaka, R., Kishimoto, A., et al. (2014). Distribution loss minimization with guaranteed error bound. IEEE Transactions on Smart Grid, 5(1), 102–111.CrossRefGoogle Scholar
  12. Ishioka, F., & Kurihara, K. (2012). Detection of spatial clusters using echelon scan. Proceedings of the 20th International Conference on Computational Statistics (COMPSTAT2012), Heidelberg: Physica-Verlag, 341–352.Google Scholar
  13. Ishioka, F., Kurihara, K., Suito, H., Horikawa, Y., & Ono, Y. (2007). Detection of hotspots for 3-dimensional spatial data and its application to environmental pollution data. Journal of Environmental Science for Sustainable Society, 1, 15–24.CrossRefGoogle Scholar
  14. Jung, I., Kulldorff, M., & Klassen, A. C. (2007). A spatial scan statistic for ordinal data. Statistics in Medicine, 26(7), 1594–1607.MathSciNetCrossRefGoogle Scholar
  15. Jung, I., Kulldorff, M., & Richard, O. J. (2010). A spatial scan statistic for multinomial data. Statistics in Medicine, 29(18), 1910–1918.MathSciNetCrossRefGoogle Scholar
  16. Kawahara, J., Inoue, T., Iwashita, H., & Minato, S. (2017a). Frontier-based search for enumerating all constrained subgraphs with compressed representation. IEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences, E100–A(9), 1773–1784.CrossRefGoogle Scholar
  17. Kawahara, J., Saitoh, T., Suzuki, H., & Yoshinaka, R. (2017b). Solving the longest oneway-ticket problem and enumerating letter graphs by augmenting the two representative approaches with ZDDs. In: S. Phon-Amnuaisuk, T.-W. Au, & S. Omar (Eds.), Computational intelligence in information systems: Proceedings of the computational intelligence in information systems conference (CIIS 2016), Cham: Springer, 294–305.Google Scholar
  18. Kawahara, J., Horiyama, T., Hotta, K., & Minato, S. (2017c). Generating all patterns of graph partitions within a disparity bound. In Proceedings of the 11th International Conference and Workshops on Algorithms and Computation (WALCOM2017), 119–131.Google Scholar
  19. Knuth, D.E. (2011). The Art of Computer Programming, Volume 4A, Combinatorial Algorithms, Part 1 (1st ed.). Addison-Wesley Professional.Google Scholar
  20. Kulldorff, M. (1997). A spatial scan statistic. Communications in Statistics: Theory and Methods, 26(6), 1481–1496.MathSciNetCrossRefzbMATHGoogle Scholar
  21. Kulldorff, M., & Harvard Medical School, Boston and Information Management Services Inc. (2018). SatScan™ v9.6: Software for the Spatial and Space-Time Scan Statistics. Accessed 1 July 2018.
  22. Kulldorff, M., Huang, L., & Konty, K. (2009). A scan statistic for continuous data based on the normal probability model. International Journal of Health Geographics, 8, 58.CrossRefGoogle Scholar
  23. Kulldorff, M., & Nagarwalla, N. (1995). Spatial disease clusters: Detection and inference. Statistics in Medicine, 14(8), 799–810.CrossRefGoogle Scholar
  24. Kurihara, K. (2004). Classification of geospatial lattice data and their graphical representation. In D. Banks et al. (Eds), Classification, clustering, and data mining applications (pp. 251–258). New York: Springer.Google Scholar
  25. Lawson, A. B., & Clark, A. (2002). Spatial mixture relative risk models applied to disease mapping. Statistics in Medicine, 21(3), 359–370.CrossRefGoogle Scholar
  26. Minato, S. (1993). Zero-suppressed BDDs for set manipulation in combinatorial problems. In Proceedings of the 30th ACM/IEEE Design Automation Conference, 272–277.Google Scholar
  27. Moran, P. A. P. (1948). The interpretation of statistical maps. Journal of the Royal Statistical Society, Series B, 10(2), 243–251.MathSciNetzbMATHGoogle Scholar
  28. Myers, W. L., Patil, G. P., & Joly, K. (1997). Echelon approach to areas of concern in synoptic regional monitoring. Environmental and Ecological Statistics, 4(2), 131–152.CrossRefGoogle Scholar
  29. Openshaw, S., Charlton, M., Wymer, C., & Craft, A. (1987). A mark 1 geographical analysis machine for the automated analysis of point data sets. International Journal of Geographical Information Systems, 1(4), 335–358.CrossRefGoogle Scholar
  30. Patil, G. P., & Taillie, C. (2004). Upper level set scan statistic for detecting arbitrarily shaped hotspots. Environmental and Ecological Statistics, 11(2), 183–197.MathSciNetCrossRefGoogle Scholar
  31. Sekine, K., Imai, H., & Tani, S. (1995). Computing the Tutte polynomial of a graph of moderate size. In Proceedings of the 6th International Symposium on Algorithms and Computation (ISAAC1995), 224–233.Google Scholar
  32. Takahashi, K., Yokoyama, T., & Tango, T. (2010). FleXScan v3.1.2: Software for the Flexible Scan Statistic. National Institute of Public Health Japan. Accessed 1 July 2018.
  33. Takizawa, A., Takechi, Y., Ohta, A., Katoh, N., Inoue, T., Horiyama, T., Kawahara, J., & Minato, S. (2013). Enumeration of region partitioning for evacuation planning based on ZDD. In 11th International Symposium on Operations Research and its Applications in Engineering, Technology and Management 2013 (ISORA 2013), Proceedings of 11th International Symposium, 1–8.Google Scholar
  34. Tango, T. (1995). A class of tests for detecting “general” and “focuses” clustering of rate diseases. Statistics in Medicine, 14(21–22), 2323–2334.CrossRefGoogle Scholar
  35. Tango, T. (2000). A test for spatial disease clustering adjusted for multiple testing. Statistics in Medicine, 19(2), 191–204.CrossRefGoogle Scholar
  36. Tango, T. (2008). A spatial scan statistic with a restricted likelihood ratio. Japanese Journal of Biometrics, 29(2), 75–95.CrossRefGoogle Scholar
  37. Tango, T., & Takahashi, K. (2005). A flexible spatial scan statistic for detecting clusters. International Journal of Health Geographics, 4, 11.CrossRefGoogle Scholar
  38. Tango, T., & Takahashi, K. (2012). A flexible spatial scan statistic with a restricted likelihood ratio for detecting disease clusters. Statistics in Medicine, 31(30), 4207–4218.MathSciNetCrossRefGoogle Scholar

Copyright information

© Japanese Federation of Statistical Science Associations 2019

Authors and Affiliations

  1. 1.Graduate School of Environmental and Life ScienceOkayama UniversityOkayamaJapan
  2. 2.Graduate School of Science and TechnologyNara Institute of Science and TechnologyNaraJapan
  3. 3.Laboratory of Advanced Data Science, Information Initiative CenterHokkaido UniversityHokkaidoJapan
  4. 4.Graduate School of InformaticsKyoto UniversityKyotoJapan

Personalised recommendations